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REMARKS 



Attached hereto is a marked-up version of the changes made to the claims by the current 
amendment. The attached page is captioned ' "Version with markings to show changes made /' 

Applicants reserve the right to prosecute non-elected subject matter in subsequent divisional 
applications. 

L Restriction requirement/election 

Election, with traverse, of the claims of Group II (encompassing claims 3-7, 9, 10, 12, 13, 46, 
48, 57, and 58), directed to polynucleotides, vectors, host cells, microarrays, and methods of using the 
polynucleotides to produce the encoded polypeptides, is acknowledged. Applicants thank the 
Examiner for acknowledging that, upon allowance of the product claims, process claims commensurate 
in scope with the allowed product claims will be rejoined. 

II. Information disclosure statement 

The Office Action indicates that "citations 24 and 25 fail to comply with the requirements for an 
IDS" because "citations 24 and 25 fail to identify a database, e.g., GenBank or EMBL, from which the 
sequences can be accessed" (Office Action, September 20, 2002; page 3). Applicants submit 
herewith a Supplemental Information Disclosure Statement which indicates that these references can be 
accessed from the publicly available NCBI (National Center for Biotechnology Information) database. 

III. Title of the application 

As suggested by the Examiner, the title of the application has been amended to more clearly 
indicate the subject matter of the claimed invention. 

IV. Claim objections 

Claims 3, 12, 13, 57, and 58 have been amended such that they recite "naturally-occurring," as 
suggested by the Examiner. Withdrawal of this claim objection is therefore requested. 
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V. Utility rejection under 35 U.S.C. § 101 

The rejection of claims 3-7, 9, 10, 12, 13, 46, 48, 57, and 58 is improper, as the 
inventions of those claims have a patentable utility as set forth in the instant specification, 
and/or a utility well-known to one of ordinary skill in the art. 

The invention at issue is a polynucleotide sequence corresponding to a gene that is expressed in 
human brain tumor tissue (Specification, e.g., at page 10, lines 23-27). The claimed polynucleotide 
encodes a polypeptide demonstrated in the patent specification to be a member of the phosphate 
transporter family, whose biological functions include regulation of intracellular phosphate levels (e.g., at 
page 1, lines 27-31; page 10, line 29 to page 1 1, line 1; and Figures 2A and 2B). As such, the claimed 
invention has numerous practical, beneficial uses in toxicology testing, drug development, and the 
diagnosis of disease, none of which require knowledge of how the polypeptide encoded by the claimed 
polynucleotide actually functions. As a result of the benefits of these uses, the claimed invention already 
enjoys significant commercial success. 

Applicants submit with this response the declaration of Dr. Tod Bedilion describing some of the 
practical uses of the claimed invention in gene and protein expression monitoring applications. The 
Bedilion Declaration demonstrates that the positions and arguments made by the Office Action with 
respect to the utility of the claimed polynucleotide are without merit. 

The Bedilion Declaration describes, in particular, how the claimed expressed polynucleotides 

can be used in gene expression monitoring applications that were well-known at the time the patent 

application was filed, and how those applications are useful in developing drugs and monitoring their 

activity. Dr. Bedilion states that the claimed invention is a useful tool when employed as highly specific 

probes in a cDNA microarray: 

Persons skilled in the art would [on February 24, 1997] appreciate that a cDNA 
microarray that contained the SEQ ID NO: 1 -encoding polynucleotides would be a 
more useful tool than a cDNA microarray that did not contain any of these 
polynucleotides, in connection with conducting gene expression monitoring studies on 
proposed (or actual) drugs for disorders associated with increased or decreased 
phosphate levels for such purposes as evaluating their efficacy and toxicity. (Bedilion 
Declaration,^ 15) 

The Office Action does not dispute that the claimed polynucleotides can be used as probes in 
cDNA microarrays and used in gene expression monitoring applications. Instead, the Office Action 
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contends that the claimed polynucleotides cannot be useful without precise knowledge of their 
biological functions, or the biological functions of their encoded polypeptide. But the law never has 
required knowledge of biological function to prove utility. It is the claimed invention's uses, not its 
functions, that are the subject of a proper analysis under the utility requirement. 

In any event, as demonstrated by the Bedilion Declaration, the person of ordinary skill in the art 
can achieve beneficial results from the claimed polynucleotides in the absence of any knowledge as to 
the precise function of the protein encoded by them. The uses of the claimed polynucleotides in gene 
expression monitoring applications are in fact independent of their precise biological functions. 

The Office Action alleges that the asserted utility of the claimed arrays and microarrays in the 
diagnosis of disorders is not substantial or specific because "the specification provides no information 
linking the polynucleotide of SEQ ID NO:2 or the polypeptide of SEQ ID NO:l to any specific 
disease state" (Office Action, September 20, 2002; page 5; emphasis in original). This is incorrect. All 
polynucleotides expressed in humans have utility in toxicology testing based on the property of being 
expressed at some time in development or in the cell life cycle, and this basis for utility does not 
preclude that utility from being specific, substantial, and credible. A toxicology test using an array or 
microarray to detect any particular expressed polynucleotide is dependent on the identity of that 
polynucleotide, not on its biological function or its disease association. The results obtained from using 
an array or microarray to detect any particular human-expressed polynucleotide in toxicology testing is 
specific to both the compound being tested and the polynucleotide detected in the test. No two human- 
expressed polynucleotides are interchangeable for toxicology testing because the effects on the 
expression of any two such polynucleotides will differ depending on the identity of the compound tested 
and the identities of the two polynucleotides. It is not necessary to know the biological functions and 
disease associations of the polynucleotides detected by the array or microarray in order to carry out 
such toxicology tests. Therefore, it is not necessary to provide "information linking the polynucleotide 
of SEQ ID NO:2 or the polypeptide of SEQ ID NO:l to any specific disease state" for the claimed 
arrays and microarrays to have a specific and substantial utility in toxicology testing. At the very least, 
the claimed arrays and microarrays capable of detecting polynucleotides expressed in humans are 
specific controls for toxicology tests in developing drugs targeted to other polynucleotides, and are 
clearly useful as such. 
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L The Applicable Legal Standard 

To meet the utility requirement of sections 101 and 1 12 of the Patent Act, the patent applicant 

need only show that the claimed invention is "practically useful," Anderson v. Nana, 480 F.2d 1392, 

1397, 178 USPQ 458 (CCPA 1973) and confers a "specific benefit" on the public. Brenner v. 

Manson, 383 U.S. 519, 534-35, 148 USPQ 689 (1966). As discussed in a recent Court of Appeals 

for the Federal Circuit case, this threshold is not high: 

An invention is "useful" under section 101 if it is capable of providing some identifiable 
benefit. See Brenner v. Manson, 383 U.S. 519, 534 [148 USPQ 689] (1966); 
Brooktree Corp, v. Advanced Micro Devices, Inc., 977 F.2d 1555, 1571 [24 
USPQ2d 1401] (Fed. Cir. 1992) ( M to violate Section 101 the claimed device must be 
totally incapable of achieving a useful result"); Fuller v. Berger, 120 F. 274, 275 (7th 
Cir. 1903) (test for utility is whether invention "is incapable of serving any beneficial 
end"). 

Juicy Whip Inc. v. Orange Bang Inc., 51 USPQ2d 1700 (Fed. Cir. 1999). 

While an asserted utility must be described with specificity, the patent applicant need not 

demonstrate utility to a certainty. In Stiftung v. Renishaw PLC, 945 F.2d 1 173, 1 180, 20 USPQ2d 

1094 (Fed. Cir. 1991), the United States Court of Appeals for the Federal Circuit explained: 

An invention need not be the best or only way to accomplish a certain result, and it 
need only be useful to some extent and in certain applications: "[T]he fact that an 
invention has only limited utility and is only operable in certain applications is not 
grounds for finding lack of utility." Envirotech Corp. v. Al George, Inc., 730 F.2d 
753, 762, 221 USPQ 473, 480 (Fed. Cir. 1984). 

The specificity requirement is not, therefore, an onerous one. If the asserted utility is described 
so that a person of ordinary skill in the art would understand how to use the claimed invention, it is 
sufficiently specific. See Standard Oil Co. v. Montedison, S.p.a., 212 U.S.P.Q. 327, 343 (3d Cir. 
1981). The specificity requirement is met unless the asserted utility amounts to a "nebulous expression" 
such as "biological activity" or "biological properties" that does not convey meaningful information 
about the utility of what is being claimed. Cross v. Iizuka, 753 F.2d 1040, 1048 (Fed. Cir. 1985). 

In addition to conferring a specific benefit on the public, the benefit must also be "substantial." 
Brenner, 383 U.S. at 534. A "substantial" utility is a practical, "real-world" utility. Nelson v. Bowler, 
626 F.2d 853, 856, 206 USPQ 881 (CCPA 1980). 
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If persons of ordinary skill in the art would understand that there is a "well-established" utility 
for the claimed invention, the threshold is met automatically and the applicant need not make any 
showing to demonstrate utility. Manual of Patent Examination Procedure at § 706.03(a). Only if there 
is no "well-established" utility for the claimed invention must the applicant demonstrate the practical 
benefits of the invention. Id. 

Once the patent applicant identifies a specific utility, the claimed invention is presumed to 
possess it. In re CortrighU 165 F.3d 1353, 1357, 49 USPQ2d 1464 (Fed. Cir. 1999); In re Brana, 
51 F.3d 1560, 1566; 34 USPQ2d 1436 (Fed. Cir. 1995). In that case, the Patent Office bears the 
burden of demonstrating that a person of ordinary skill in the art would reasonably doubt that the 
asserted utility could be achieved by the claimed invention. Id. To do so, the Patent Office must 
provide evidence or sound scientific reasoning. See In re hanger, 503 F.2d 1380, 1391-92, 183 
USPQ 288 (CCPA 1974). If and only if the Patent Office makes such a showing, the burden shifts to 
the applicant to provide rebuttal evidence that would convince the person of ordinary skill that there is 
sufficient proof of utility. Brana, 51 F.3d at 1566. The applicant need only prove a "substantial 
likelihood" of utility; certainty is not required. Brenner, 383 U.S. at 532. 

EL Toxicology testing, drug discovery, and disease diagnosis are sufficient utilities under 
35 U.S.C. §§ 101 and 112, first paragraph 

The claimed invention meets all of the necessary requirements for establishing a credible utility 
under the Patent Law: There are "well-established" uses for the claimed invention known to persons of 
ordinary skill in the art, and there are specific practical and beneficial uses for the invention disclosed in 
the patent application's specification. These uses are explained, in detail, in the Bedilion Declaration 
accompanying this response. Objective evidence, not considered by the Patent Office, further 
corroborates the credibility of the asserted utilities. 

A. The use of the claimed polynucleotides for toxicology testing, drug discovery, 
and disease diagnosis are practical uses that confer "specific benefits" to the 
public 

The claimed invention has specific, substantial, real- world utility by virtue of its use in toxicology 
testing, drug development and disease diagnosis through gene expression profiling. These uses are 
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explained in detail in the accompanying Bedilion Declaration, the substance of which is not rebutted by 
the Office Action. There is no dispute that the claimed invention is in fact a useful tool in cDNA 
microarrays used to perform gene expression analysis. That is sufficient to establish utility for the 
claimed polynucleotides. 

The instant application is a divisional of, and claims priority to, Lai et al. (U.S. Ser. No. 
09/391,958, filed September 8, 1999; hereinafter "the Lai '958 application"), which is a divisional of, 
and claims priority to, Lai et al. (U.S. Ser. No. 08/805,118, filed February 24, 1997; hereinafter "the 
Lai '118 application"). The instant application and the Lai '958 and Lal'l 18 applications were filed 
with essentially identical specifications, with the exception of corrected typographical errors and 
reformatting. Thus page and line numbers may not match as between the instant application and the Lai 
'958 and Lal'118 applications. 

In his Declaration, Dr. Bedilion explains the many reasons why a person skilled in the art 
reading the Lai '1 18 application on February 24, 1997 would have understood that application to 
disclose the claimed polynucleotides to be useful for a number of gene expression monitoring 
applications, e.g., as highly specific probes for the expression of those specific polynucleotides in 
connection with the development of drugs and the monitoring of the activity of such drugs. (Bedilion 
Declaration at, e.g., ffl 10-15). Much, but not all, of Dr. Bedilion's explanation concerns the use of the 
claimed polynucleotides in cDNA microarrays of the type first developed at Stanford University for 
evaluating the efficacy and toxicity of drugs, as well as for other applications. (Bedilion Declaration, H 
12 and 15). 1 

In connection with his explanations, Dr. Bedilion states that "the specification of the Lai '1 18 
application would have led a person skilled in the art in February 1997, who was using gene expression 
monitoring in connection with developing new drugs for the treatment of disorders associated with 
increased or decreased phosphate levels, to conclude that a cDNA microarray that contained the SEQ 
ID NO: 1 -encoding polynucleotides would be a highly useful tool and to request specifically that any 



! Dr. Bedilion also explained, for example, why persons skilled in the art would also appreciate, 
based on the Lai '1 18 specification, that the claimed polynucleotides would be useful in connection with 
developing new drugs using technology, such as Northern analysis, that predated by many years the 
development of the cDNA technology (Bedilion Declaration, % 16). 
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cDNA microarray that was being used for such purposes contain the SEQ ID NO: 1 -encoding 
polynucleotides/' (Bedilion Declaration, f 15). For example, as explained by Dr. Bedilion, "[p]ersons 
skilled in the art would [on February 24, 1997] appreciate that a cDNA microarray that contained the 
SEQ ID NO: 1 -encoding polynucleotides would be a more useful tool than a cDNA microarray that did 
not contain any of these polynucleotides, in connection with conducting gene expression monitoring 
studies on proposed (or actual) drugs for disorders associated with increased or decreased phosphate 
levels for such purposes as evaluating their efficacy and toxicity." Id. 

In support of those statements, Dr. Bedilion provided detailed explanations of how cDNA 
technology can be used to conduct gene expression monitoring evaluations, with extensive citations to 
pre- and post-February 24, 1997 publications showing the state of the art on February 24, 1997. 
(Bedilion Declaration, fP0-14). While Dr. Bedilion' s explanations in paragraph 15 of his Declaration 
include almost three and a half pages of text and six subparts (a)-(f), he specifically states that his 
explanations are not "all-inclusive." Id, For example, with respect to toxicity evaluations, Dr. Bedilion 
had earlier explained how persons skilled in the art who were working on drug development on 
February 24, 1997 (and for several years prior to February 24, 1997) "without any doubt" appreciated 
that the toxicity (or lack of toxicity) of any proposed drug was "one of the most important criteria to be 
considered and evaluated in connection with the development of the drug" and how the teachings of the 
Lai 4 1 18 application clearly include using differential gene expression analyses in toxicity studies 
(Bedilion Declaration, 10). 

Thus, the Bedilion Declaration establishes that persons skilled in the art reading the Lai '1 18 
application at the time it was filed "would have wanted their cDNA microarray to have a probe to a 
SEQ ID NO:l-encoding polynucleotide because a microarray that contained such a probe (as 
compared to one that did not) would provide more useful results in the kind of gene expression 
monitoring studies using cDNA microarrays that persons skilled in the art have been doing since well 
prior to February 24, 1997." (Bedilion Declaration, f 15, item (f) ). This, by itself, provides more than 
sufficient reason to compel the conclusion that the Lai '118 application disclosed to persons skilled in 
the art at the time of its filing substantial, specific and credible real- world utilities for the claimed 
polynucleotides. 
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Nowhere does the Office Action address the fact that, as described, for example, on pages 20- 
21 and 31 of the Lai '1 18 application, the claimed polynucleotides can be used as highly specific 
probes in, for example, cDNA microarrays - probes that without question can be used to measure both 
the existence and amount of complementary RNA sequences known to be the expression products of 
the claimed polynucleotides. The claimed invention is not, in that regard, some random sequence 
whose value as a probe is speculative or would require further research to determine. 

Given the fact that the claimed SEQ ID NO:2 polynucleotide is known to be expressed, its 
utility as a measuring and analyzing instrument for expression levels is as indisputable as a scale's utility 
for measuring weight. This use as a measuring tool, regardless of how the expression level data 
ultimately would be used by a person of ordinary skill in the art, by itself demonstrates that the claimed 
invention provides an identifiable, real-world benefit that meets the utility requirement. Raytheon v. 
Roper, 724 F.2d 951, (Fed. Cir. 1983) (claimed invention need only meet one of its stated objectives 
to be useful); In re Cortwright, 165 F.3d 1353, 1359 (Fed. Cir. 1999) (how the invention works is 
irrelevant to utility); M.P.E.P. § 2107 ("Many research tools such as gas chromatographs, screening 
assays, and nucleotide sequencing techniques have a clear, specific, and unquestionable utility (e.g., 
they are useful in analyzing compounds )" (emphasis added) ). 

Though Applicants need not so prove to demonstrate utility, there can be no reasonable dispute 
that persons of ordinary skill in the art have numerous uses for information about relative gene 
expression including, for example, understanding the effects of a potential drug for treating disorders 
associated with increased or decreased phosphate levels. Because the patent application states 
explicitly that the claimed polynucleotide is known to be expressed in brain tumor cells (see the Lai 
'118 application at page 11, lines 20-22; and page 38, lines 25-30), and expresses a protein that is a 
member of a class known to regulate intracellular phosphate levels, there can be no reasonable dispute 
that a person of ordinary skill in the art could put the claimed invention to such use. In other words, the 
person of ordinary skill in the art can derive more information about a potential drug candidate for 
disorders associated with increased or decreased phosphate levels, or potential toxin, with the claimed 
invention than without it (see Bedilion Declaration at, e.g., f 15, subparts (e)-(f) ). 

The Bedilion Declaration shows that a number of pre-February 24, 1997 publications confirm 
and further establish the utility of cDNA microarrays in a wide range of drug development gene 
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expression monitoring applications at the time the Lai '118 application was filed (Bedilion Declaration 
<H 10-14; and Tabs A-G). Indeed, Brown and Shalon U.S. Patent No. 5,807,522 (the Brown '522 
patent, Bedilion Declaration at Tab D), which issued from a patent application filed in June 1995 and 
was effectively published on December 29, 1995 as a result of the publication of a PCT counterpart 
application, shows that the Patent Office recognizes the patentable utility of the cDNA technology 
developed in the early to mid-1990s. As explained by Dr. Bedilion, among other things (Bedilion 
Declaration, <fl 12): 

The Brown '522 patent further teaches that the "[m]icroarrays of immobilized nucleic 
acid sequences prepared in accordance with the invention" can be used in "numerous" 
genetic applications, including "monitoring of gene expression" applications (see Tab D 
at col. 14, lines 36-42). The Brown '522 patent teaches (a) monitoring gene 
expression (i) in different tissue types, (ii) in different disease states, and (iii) in response 
to different drugs, and (b) that arrays disclosed therein may be used in toxicology 
studies (see Tab D at col. 15, lines 13-18 and 52-58 and col 18, lines 25-30). 

Literature reviews published shortly after the filing of the Lai '118 application describing the 
state of the art further confirm the claimed invention's utility. Rockett et al. confirm, for example, that 
the claimed invention is useful for differential expression analysis regardless of how expression is 
regulated: 

Despite the development of multiple technological advances which have recently 
brought the field of gene expression profiling to the forefront of molecular analysis, 
recognition of the importance of differential gene expression and characterization of 
differentially expressed genes has existed for many years. 

* * * 

Although differential expression technologies are applicable to a broad range of models, 
perhaps their most important advantage is that, in most cases, absolutely no prior 
knowledge of the specific genes which are up- or down-regulated is required. 

* * * 

Whereas it would be informative to know the identity and functionality of all genes 
up/down regulated by . . . toxicants, this would appear a longer term goal .... 
However, the current use of gene profiling yields a pattern of gene changes for a 
xenobiotic of unknown toxicity which may be matched to that of well characterized 
toxins, thus alerting the toxicologist to possible in vivo similarities between the unknown 
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and the standard, thereby providing a platform for more extensive toxicological 
examination, [emphasis added] 

Rockett et al., Differential gene expression in drug metabolism and toxicology: Practicalities, problems 

and potential , Xenobiotica 29(7):655 (1999). 

In another article, Lashkari et al. state explicitly that sequences that are merely "predicted" to 

be expressed (predicted Open Reading Frames, or ORFs) - the claimed invention in fact is known to 

be expressed - have numerous uses: 

Efforts have been directed toward the amplification of each predicted ORF or any 
other region of the genome ranging from a few base pairs to several kilobase pairs. 
There are many uses for these amplicons- they can be cloned into standard vectors or 
specialized expression vectors, or can be cloned into other specialized vectors such as 
those used for two-hybrid analysis. The amplicons can also be used directly by, 
for example, arraying onto glass for expression analysis , for DNA binding 
assays, or for any direct DNA assay, [emphasis added] 

Lashkari et al., Whole genome analysis: Experimental access to all genome sequenced segments 
through larger-scale efficient oligonucleotide synthesis and PGR , Proceedings of the National Academy 
of Sciences USA 94:8945 (Aug. 1997). 

B. The use of nucleic acids coding for proteins expressed by humans as tools for 
toxicology testing, drug discovery, and the diagnosis of disease is now "well- 
established" 

The technologies made possible by expression profiling and the DNA tools upon which they 
rely are now well-established. The technical literature recognizes not only the prevalence of these 
technologies, but also their unprecedented advantages in drug development, testing and safety 
assessment. These technologies include toxicology testing, as described by Dr. Bedilion in his 
declaration. 

Toxicology testing is now standard practice in the pharmaceutical industry. See, e.g., John C. 

Rockett et al., supra: 

Knowledge of toxin-dependent regulation in target tissues is not solely an academic 
pursuit as much interest has been generated in the pharmaceutical industry to harness 
this technology in the early identification of toxic drug candidates, thereby shortening the 



103594 



17 



09/991,212 



Docket No.: PF-0221-3 DIV 

developmental process and contributing substantially to the safety assessment of new 
drugs. 

To the same effect are several other scientific publications, including Emile F. Nuwaysir et al., 

Microarravs and toxicology: The advent of toxico genomics . Molecular Carcinogenesis 24:153 (1999); 

Sandra Steiner and N. Leigh Anderson, Expression profiling in toxicology - potentials and limitations , 

Toxicology Letters 112-113:467 (2000). 

Nucleic acids useful for measuring the expression of whole classes of genes are routinely 

incorporated for use in toxicology testing. Nuwaysir et al. describes, for example, a Human ToxChip 

comprising 2089 human clones, which were selected 

for their well-documented involvement in basic cellular processes as well as their 
responses to different types of toxic insult. Included on this list are DNA replication 
and repair genes, apoptosis genes, and genes responsive to PAHs and dioxin-like 
compounds, peroxisome proliferators, estrogenic compounds, and oxidant stress. 
Some of the other categories of genes include transcription factors, oncogenes, tumor 
suppressor genes, cyclins, kinases, phosphatases, cell adhesion and motility genes, and 
homeobox genes. Also included in this group are 84 housekeeping genes, whose 
hybridization intensity is averaged and used for signal normalization of the other genes 
on the chip. 

See also Table 1 of Nuwaysir et al. (listing additional classes of genes deemed to be of special interest 
in making a human toxicology microarray). 

The more genes that are available for use in toxicology testing, the more powerful the technique. 
"Arrays are at their most powerful when they contain the entire genome of the species they are being 
used to study." John C. Rockett and David J. Dix, Application of DNA arrays to toxicology . 
Environmental Health Perspectives 107(8):681 (1999). Control genes are carefully selected for their 
stability across a large set of array experiments in order to best study the effect of toxicological 
compounds. See attached email from the primary investigator of the Nuwaysir paper, Dr. Cynthia 
Afshari to an Incyte employee, dated July 3, 2000, as well as the original message to which she was 
responding. Thus, there is no expressed gene which is irrelevant to screening for toxicological effects, 
and all expressed genes have a utility for toxicological screening. 

In fact, the potential benefit to the public, in terms of lives saved and reduced health care costs, 
are enormous. Recent developments provide evidence that the benefits of this information are already 
beginning to manifest themselves. Examples include the following: 
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• In 1999, CV Therapeutics, an Incyte collaborator, was able to use Incyte gene 
expression technology, information about the structure of a known transporter 
gene, and chromosomal mapping location, to identify the key gene associated 
with Tangiers disease. This discovery took place over a matter of only a few 
weeks, due to the power of these new genomics technologies. The discovery 
received an award from the American Heart Association as one of the top 10 
discoveries associated with heart disease research in 1999. 

• In an April 9, 2000, article published by the Bloomberg news service, an Incyte 
customer stated that it had reduced the time associated with target discovery 
and validation from 36 months to 18 months, through use of Incyte's genomic 
information database. Other Incyte customers have privately reported similar 
experiences. The implications of this significant saving of time and expense for 
the number of drugs that may be developed and their cost are obvious. 

• In a February 10, 2000, article in the Wall Street Journal, one Incyte 
customer stated that over 50 percent of the drug targets in its current pipeline 
were derived from the Incyte database. Other Incyte customers have privately 
reported similar experiences. By doubling the number of targets available to 
pharmaceutical researchers, Incyte genomic information has demonstrably 
accelerated the development of new drugs. 

Because the Office Action failed to address or consider the "well-established" utilities for the 
claimed invention in toxicology testing, drug development, and the diagnosis of disease, the rejections 
should be withdrawn regardless of their merit. 

C The similarity of the polypeptide encoded by the claimed invention to another 
polypeptide of undisputed utility demonstrates utility 

In addition to having substantial, specific and credible utilities in numerous gene expression 

monitoring applications, the utility of the claimed polynucleotides can be imputed based on the 

relationship between the polypeptide they encode, NAPTR, and another polypeptide of unquestioned 

utility, human renal sodium phosphate transport protein (NPT1). The two polypeptides have sufficient 

similarities in their sequences that a person of ordinary skill in the art would recognize more than a 

reasonable probability that the polypeptide encoded for by the claimed invention has utility similar to 

NPT1. Applicants need not show any more to demonstrate utility. In re Brana, 51 F.3d at 1567. 
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It is undisputed that the polypeptide coded for by the claimed polynucleotides shares 48% 
sequence identity over 401 amino acid residues with NPT1 (Specification, e.g., page 10, line 32 to 
page 11, line 1; and Figures 2A and 2B). In addition, NAPTR, NPT1, and rat brain-specific sodium- 
dependent inorganic phosphate cotransporter all share a potential N-glycosylation site (e.g., at page 11, 
lines 1-2), and have rather similar hydrophobicity plots (e.g., at page 11, lines 2-4; and Figures 3A, 3B, 
and 3C). This is more than enough homology to demonstrate a reasonable probability that the utility of 
NPT1 can be imputed to the polynucleotides of the claimed invention (through the polypeptide they 
encode). It is well-known that the probability that two unrelated polypeptides share more than 40% 
sequence homology over 70 amino acid residues is exceedingly small. Brenner et al., Proceedings of 
the National Academy of Sciences USA 95:6073-6078 (1998). Given homology in excess of 40% 
over more than 70 amino acid residues, the probability that the polypeptide coded for by the claimed 
polynucleotides is related to NPT1 is, accordingly, very high. 

The Patent Office must accept the Applicants' demonstration that the homology between the 
polypeptide coded for by the claimed invention and NPT1 demonstrates utility by a reasonable 
probability unless the Patent Office can demonstrate through evidence or sound scientific reasoning that 
a person of ordinary skill in the art would doubt utility. See In re hanger, 503 F.2d 1380, 1391-92, 
183 USPQ 288 (CCPA 1974). The Patent Office has not provided sufficient evidence or sound 
scientific reasoning to the contrary. 

While the Patent Office has cited literature identifying some of the difficulties that may be 
involved in predicting protein function, none suggests that functional homology cannot be inferred by a 
reasonable probability in this case, van de Loo et al., Proceedings of the National Academy of 
Sciences USA 92:6743-6747 (1995); Seffernick et al., Journal of Bacteriology 183:2405-2410 
(2001); Bork, Genome Research 10:398-400 (2000). Importantly, none contradicts Brenner's basic 
rule that sequence homology in excess of 40% over 70 or more amino acid residues yields a high 
probability of functional homology as well. Brenner et al., Proceedings of the National Academy of 
Sciences USA 95:6073-6078 (1998). At most, these articles individually and together stand for the 
proposition that it is difficult to make predictions about function with certainty. The standard applicable 
in this case is not, however, proof to certainty, but rather proof to reasonable probability. 
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D. Objective evidence corroborates the utilities of the claimed invention 

There is in fact no restriction on the kinds of evidence a Patent Examiner may consider in 
determining whether a "real-world" utility exists. "Real-world" evidence, such as evidence showing 
actual use or commercial success of the invention, can demonstrate conclusive proof of utility. 
Raytheon v. Roper, 220 USPQ2d 592 (Fed. Cir. 1983); Nestle v. Eugene, 55 F.2d 854, 856, 12 
USPQ 335 (6th Cir. 1932). Indeed, proof that the invention is made, used or sold by any person or 
entity other than the patentee is conclusive proof of utility. United States Steel Corp. v. Phillips 
Petroleum Co., 865 F.2d 1247, 1252, 9 USPQ2d 1461 (Fed. Cir. 1989). 

Over the past several years, a vibrant market has developed for databases containing all 
expressed genes (along with the polypeptide translations of those genes), in particular genes having 
medical and pharmaceutical significance such as the instant sequence. (Note that the value in these 
databases is enhanced by their completeness, but each sequence in them is independently valuable.) 
The databases sold by Applicants' assignee, Incyte, include exactly the kinds of information made 
possible by the claimed invention, such as tissue and disease associations. Incyte sells its database 
containing the claimed sequence and millions of other sequences, throughout the scientific community, 
including to pharmaceutical companies who use the information to develop new pharmaceuticals. 

Both Incyte's customers and the scientific community have acknowledged that Incyte' s 
databases have proven to be valuable in, for example, the identification and development of drug 
candidates. As Incyte adds information to its databases, including the information that can be generated 
only as a result of Incyte's invention of the polypeptide encoded by the claimed polynucleotides, the 
databases become even more powerful tools. Thus the claimed invention adds more than incremental 
benefit to the drug discovery and development process. 

Customers can, moreover, purchase the claimed SEQ ID NO: 2 polynucleotide directly from 
Incyte, saving the customer the time and expense of isolating and purifying or cloning the polynucleotide 
for research uses such as those described supra. 

HI. The Office Action's Rejections Are Without Merit 

Rather than responding to the evidence demonstrating utility, the Office Action attempts to 
dismiss it altogether by arguing that the disclosed and well-established utilities for the claimed 
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polynucleotides are not ''specific and substantial asserted" utilities (Office. Action, September 20, 2002; 
page 3, «J[ 4). The Office Action is incorrect both as a matter of law and as a matter of fact. 

A. The precise biological role or function of an expressed polynucleotide is not 
required to demonstrate utility 

The Office Action's primary rejection of the claimed invention is based on the ground that, 
without information as to the precise "biological role" of the claimed invention, the claimed invention's 
utility is not sufficiently specific. According to the Office Action, it is not enough that a person of 
ordinary skill in the art could use and, in fact, would want to use the claimed invention either by itself or 
in a cDNA microarray to monitor the expression of genes for such applications as the evaluation of a 
drug's efficacy and toxicity. The Office Action would require, in addition, that the Applicants provide a 
specific and substantial interpretation of the results generated in any given expression analysis. 

It may be that specific and substantial interpretations and detailed information on biological 
function are necessary to satisfy the requirements for publication in some technical journals, but they are 
not necessary to satisfy the requirements for obtaining a United States patent. The relevant question is 
not, as the Office Action would have it, whether it is known how or why the invention works, In re 
Cortwright y 165 F.3d 1353, 1359 (Fed. Cir. 1999), but rather whether the invention provides an 
"identifiable benefit" in presently available form. Juicy Whip Inc. v. Orange Bang Inc., 185 F.3d 
1364, 1366 (Fed. Cir. 1999). If the benefit exists, and there is a substantial likelihood the invention 
provides the benefit, it is useful. There can be no doubt, particularly in view of the Bedilion Declaration 
(at, e.g., ffll 10 and 15), that the present invention meets this test. 

The threshold for determining whether an invention produces an identifiable benefit is low. 
Juicy Whip, 185 F.3d at 1366. Only those utilities that are so nebulous that a person of ordinary skill 
in the art would not know how to achieve an identifiable benefit and, at least according to the PTO 
guidelines, so-called "throwaway" utilities that are not directed to a person of ordinary skill in the art at 
all, do not meet the statutory requirement of utility. Utility Examination Guidelines, 66 Fed. Reg. 1092 
(Jan. 5, 2001). 

Knowledge of the biological function or role of a biological molecule has never been required to 
show real- world benefit. In its most recent explanation of its own utility guidelines, the PTO 
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acknowledged so much (66 F.R. at 1095): 

[T]he utility of a claimed DNA does not necessarily depend on the function of the 
encoded gene product. A claimed DNA may have specific and substantial utility 
because, e.g., it hybridizes near a disease-associated gene or it has gene-regulating 
activity. 

By implicitly requiring knowledge of biological function for any claimed nucleic acid, the Office 
Action has, contrary to law, elevated what is at most an evidentiary factor into an absolute requirement 
of utility. Rather than looking to the biological role or function of the claimed invention, the Office 
should have looked first to the benefits it is alleged to provide. 

B. Membership in a class of useful products can be proof of utility 

Despite the evidence that the claimed polynucleotides encode a polypeptide in the phosphate 
transporter family, the Office Action refused to impute the utility of the members of the phosphate 
transporter family to NAPTR. The Office Action of September 20, 2002 takes the position that, unless 
Applicants can identify which particular biological function within the class of phosphate transporters is 
possessed by NAPTR, utility cannot be imputed. To demonstrate utility by membership in the class of 
phosphate transporters, the Office would require that all phosphate transporters possess a "common" 
utility. 

There is no such requirement in the law. In order to demonstrate utility by membership in a 
class, the law requires only that the class not contain a substantial number of useless members. So long 
as the class does not contain a substantial number of useless members, there is sufficient likelihood that 
the claimed invention will have utility, and a rejection under 35 U.S.C. § 101 is improper. That is true 
regardless of how the claimed invention ultimately is used and whether or not the members of the class 
possess one utility or many. See Brenner v. Manson, 383 U.S. 519, 532 (1966); Application of 
Kirk, 376 F.2d 936, 943 (CCPA 1967). 

Membership in a "general" class is insufficient to demonstrate utility only if the class contains a 
sufficient number of useless members such that a person of ordinary skill in the art could not impute 
utility by a substantial likelihood. There would be, in that case, a substantial likelihood that the claimed 
invention is one of the useless members of the class. In the few cases in which class membership did 
not prove utility by substantial likelihood, the classes did in fact include predominately useless members. 
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E.g., Brenner (man-made steroids); Kirk (same); Natta (man-made polyethylene polymers). 

The Office Action addresses NAPTR as if the general class in which it is included is not the 
phosphate transporter family, but rather all polynucleotides or all polypeptides, including the vast 
majority of useless theoretical molecules not occurring in nature, and thus not pre-selected by nature to 
be useful. While these "general classes" may contain a substantial number of useless members, the 
phosphate transporter family does not. The phosphate transporter family is sufficiently specific to rule 
out any reasonable possibility that NAPTR would not also be useful like the other members of the 
family. 

Because the Office Action has not presented any evidence that the class of phosphate 
transporters has any, let alone a substantial number, of useless members, the Office Action must 
conclude that there is a "substantial likelihood" that the NAPTR encoded by the claimed 
polynucleotides is useful. It follows that the SEQ ID NO:2 polynucleotide also is useful. 

Even if the Office Action's "common utility" criterion were correct - and it is not - the 
phosphate transporter family would meet it. It is undisputed that known members of the phosphate 
transporter family are proteins involved in the regulation of intracellular phosphate levels. A person of 
ordinary skill in the art need not know any more about how the claimed invention participates in the 
regulation of intracellular phosphate levels to use it, and the Office Action presents no evidence to the 
contrary. Instead, the Office Action makes the conclusory observation that a person of ordinary skill in 
the art would need to know whether, for example, any given phosphate transporter carries out a 
particular role in the regulation of intracellular phosphate levels. The Office Action then goes on to 
assume that the only use for NAPTR absent knowledge as to how the phosphate transporter actually 
works is further study of NAPTR itself. 

Not so. As demonstrated by Applicants, knowledge that NAPTR is a phosphate transporter is 
more than sufficient to make it useful for the diagnosis and treatment of disorders associated with 
increased or decreased phosphate levels. Indeed, NAPTR has been shown to be expressed in human 
brain tumor tissues. The Patent Office must accept these facts to be true unless the Office can provide 
evidence or sound scientific reasoning to the contrary. But the Patent Office has not done so. 
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C. The uses of the claimed polynucleotides in toxicology testing, drug discovery, 
and disease diagnosis are practical uses beyond mere study of the invention 
itself 

The Office Action's rejection of the claims at issue is tantamount to a rejection on the ground 
that the use of an invention as a tool for research is not a "substantial" use. Because the Office Action's 
rejection assumes a substantial overstatement of the law, and is incorrect in fact, it must be withdrawn. 

There is no authority for the proposition that use as a tool for research is not a substantial utility. 

Indeed, the Patent Office itself has recognized that just because an invention is used in a research setting 

does not mean that it lacks utility (M.P.E.P. § 2107): 

Many research tools such as gas chromatographs, screening assays, and nucleotide 
sequencing techniques have a clear, specific and unquestionable utility (e.g., they are 
useful in analyzing compounds). An assessment that focuses on whether an invention is 
useful only in a research setting thus does not address whether the specific invention is 
in fact "useful" in a patent sense. Instead, Office personnel must distinguish between 
inventions that have a specifically identified substantial utility and inventions whose 
asserted utility requires further research to identify or reasonably confirm. 

The Patent Office's actual practice has been, at least until the present, consistent with that approach. It 
has routinely issued patents for inventions whose only use is to facilitate research, such as DNA ligases. 
These are acknowledged by the Patent Office's Training Materials to be useful, as are polynucleotide 
sequences used, for example, as markers. 

The subset of research uses that are not "substantial" utilities is limited. It consists only of those 
uses in which the claimed invention is to be an object of further study, thus merely inviting further 
research on the invention itself. This follows from Brenner, in which the U.S. Supreme Court held that 
a process for making a compound does not confer a substantial benefit where the only known use of 
the compound was to be the object of further research to determine its use. Id. at 535. Similarly, in 
Kirk, the Court held that a compound would not confer substantial benefit on the public merely 
because it might be used to synthesize some other, unknown compound that would confer substantial 
benefit. Kirk, 376 F.2d at 940, 945. ('What appellants are really saying to those in the art is take 
these steroids, experiment, and find what use they do have as medicines.") Nowhere do those cases 
state or imply, however, that a material cannot be patentable if it has some other, additional beneficial 
use in research. 
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As used in toxicology testing, drug discovery, and disease diagnosis, the claimed invention has a 
beneficial use in research other than studying the claimed invention or its protein products. It is a tool, 
rather than an object, of research. The data generated in gene expression monitoring using the claimed 
invention as a tool is not used merely to study the claimed polynucleotide itself, but rather to study 
properties of tissues, cells, and potential drug candidates and toxins. Without the claimed invention, the 
information regarding the properties of tissues, cells, drug candidates and toxins is less complete. 
(Bedilion Declaration at f 15.) 

The use of the claimed invention as a research tool in toxicology testing is specific and 
substantial. While it is true that all polynucleotides expressed in humans have utility in toxicology testing 
based on the property of being expressed at some time in development or in the cell life cycle, this basis 
for utility does not preclude that utility from being specific and substantial. A toxicology test using any 
particular expressed polynucleotide is dependent on the identity of that polynucleotide, not on its 
biological function or its disease association. The results obtained from using any particular human- 
expressed polynucleotide in toxicology testing is specific to both the compound being tested and the 
polynucleotide used in the test. No two human-expressed polynucleotides are interchangeable 
for toxicology testing because the effects on the expression of any two such polynucleotides will differ 
depending on the identity of the compound tested and the identities of the two polynucleotides. It is 
not necessary to know the biological functions and disease associations of the polynucleotides in order 
to carry out such toxicology tests. Therefore, at the very least, the claimed polynucleotides are specific 
controls for toxicology tests in developing drugs targeted to other polynucleotides, and are clearly useful 
as such. 

As an example, any histone gene expressed in humans can be used in a specific and substantial 
toxicology test in drug development. A histone gene may not be suitable as a target for drug 
development because disruption of such a gene may kill a patient. However, a human-expressed 
histone gene is surely an excellent subject for toxicology studies when developing drugs targeted to 
other genes . A drug candidate which alters expression of a histone gene is toxic because disruption of 
such a pervasively-expressed gene would have undesirable side effects in a patient. Therefore, when 
testing the toxicology of a drug candidate targeted to another gene, measuring the expression of a 
histone gene is a good measure of the toxicity of that candidate, particularly in in vitro cellular assays at 
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an early stage of drug development. The utility of any particular human-expressed histone gene in 
toxicology testing is specific and substantial because a toxicology test using that histone gene cannot be 
replaced by a toxicology test using a different gene, including any other histone gene. This specific and 
substantial utility requires no knowledge of the biological function or disease association of the histone 
gene. 

The claimed invention has numerous additional uses as a research tool, each of which alone is a 
"substantial utility." These include diagnostic assays (Specification, e.g., at pages 32-33), chromosomal 
mapping (e.g., at pages 33-34), etc. 

D. The Office Action failed to demonstrate that a person of ordinary skill in the art 
would reasonably doubt the utility of the claimed invention 

Based principally on citations to scientific literature identifying some of the difficulties involved in 

predicting protein function, the Office Action rejected the pending claims on the ground that the 

Applicants cannot impute utility to the claimed invention based on the 48% identity over 401 amino acid 

residues between the encoded polypeptide, NAPTR, and another polypeptide undisputed by the Office 

Action to be useful. The Office Action's rejection is both incorrect as a matter of fact and as a matter 

of procedural law. 

As demonstrated in § II.C, supra, the literature cited in the Office Action is not inconsistent 
with the Applicants' proof of homology by a reasonable probability. It may show that Applicants 
cannot prove function by homology with certainty, but Applicants need not meet such a rigorous 
standard of proof. Under the applicable law, once the Applicants demonstrate a prima facie case of 
homology, the Office must accept the assertion of utility to be true unless the Office comes forward with 
evidence showing a person of ordinary skill would doubt the asserted utility could be achieved by a 
reasonable probability. See In re Brana, 51 R3d at 1566; In re hanger, 503 F.2d 1380, 1391-92, 
183 USPQ 288 (CCPA 1974). The Office has not made such a showing and, as such, the Office 
Action's rejection should be withdrawn. 

In the present case, the Office Action contended that the degree of amino acid identity among 
NAPTR and other phosphate transporter proteins is insufficient to establish that NAPTR is a member 
of the phosphate transporter family and thus shares the same utilities. The Office attempted to support 
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this assertion with the teachings of van de Loo et al. (Proc. Natl. Acad. Sci. USA (1995) 92:6743- 
6747), Seffernick et al. (J. Bacteriol. (2001) 183:2405-2410), and Bork (Genome Res. (2000) 
10:398-400), all of record and addressed below. However, all of these references fail to support the 
outstanding rejections. 

In support of Applicants' use of amino acid sequence homology to reasonably predict the utility 
of the polypeptide encoded by the claimed polynucleotides, Applicants provide the enclosed reference 
by Brenner et al. ("Assessing sequence comparison methods with reliable structurally identified distant 
evolutionary relationships," Proc. Natl. Acad. Sci. USA (1998) 95:6073-6078). Through exhaustive 
analysis of a dataset of proteins with known structural and functional relationships and with <90% 
overall sequence identity, Brenner et al. have determined that 40% identity is a reliable threshold for 
establishing evolutionary homology between two sequences aligned over at least 70 residues, and that 
30% identity is a reliable threshold between two sequences aligned over at least 150 residues (Brenner 
et al., page 6076). Therefore, the 48% sequence identity between SEQ ID NO:l and the human renal 
sodium phosphate transport protein NPT1, over 401 amino acid residues, exceeds the threshold 
proposed by Brenner et al., and SEQ ID NO:l is a true phosphate transporter protein by these criteria. 
Since these criteria are based on a dataset of homologous proteins with shared structural and functional 
features, one of ordinary skill in the art would likewise expect SEQ ID NO:l to possess the 
evolutionarily conserved structural and functional characteristics of the NPT1 protein. Hence the 
"reasonable correlation" standard as set by case law has been met. 

Contrary to the assertions of the Office Action, the use of such sequence comparisons to 
predict protein function is supported by the Bork reference, cited by the Office Action. The Bork 
reference discloses a 70% accuracy rate in bioinformatics-based predictions. This more than meets 
the legal standard of utility, which requires only that one of skill in the art would more likely than not 
believe the utility of the claimed invention. For predicting functional features by homology, Table 1 of 
Bork discloses a 90% accuracy rate, even greater than the 70% rate for all bioinformatics predictions. 

The Office Action cited van de Loo et al. and Seffernick et al. as evidence that "homologous 
proteins having significant sequence homology may exhibit different functions" (Office Action, 
September 20, 2002; page 4). van de Loo et al. describe the cloning of a fatty acyl hydroxylase based 
on sequence homology between certain fatty acyl hydroxylases and fatty acyl desaturases. In this 
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example, the authors characterize fatty acyl hydroxylases and fatty acyl desaturases as catalyzing similar 
reactions (e.g., van de Loo et al., page 6743, right column, last paragraph), and conclude that the 
reaction mechanisms of oleate 12-hydroxylase and oleate desaturase are similar based on the sequence 
homology between them (e.g., van de Loo et al., abstract). In addition, Broun et al. (Science (1998) 
282:1315-1317; cited in the Office Action) characterize oleate desaturases and oleate hydroxylases as 
being "members of a large family of functionally diverse enzymes" (at page 1315, first column). Since 
the functions of the proteins described by van de Loo et al. are similar, and since these proteins belong 
to the same family, it is not surprising that they share 67% sequence homology. In fact, this 67% 
sequence homology is an accurate indicator that these two proteins belong to the same family, further 
supporting the use of sequence homology to predict protein function. 

Similarly, Seffernick et al. describe a melamine deaminase and an atrazine chlorohydrolase that 
share 98% sequence identity and yet have different substrate specificities. These two enzymes both 
belong to the amidohydrolase enzyme superfamily whose members catalyze the hydrolytic displacement 
of amino groups or chlorine substituents from triazine ring compounds (e.g., Seffernick et al., page 
2409, right column, second paragraph). Notably, there is at least one member of the amidohydrolase 
superfamily that catalyzes both deamination and dechlorination reactions with triazine ring substrates 
(Id). Therefore, the 98% sequence homology between melamine deaminase and atrazine 
chlorohydrolase correctly predicts their functional similarity and their membership in a common enzyme 
family. 

These examples in which it is difficult to obtain a precise functional prediction do not contradict 
the findings of Bork that, in the majority of cases, protein function is accurately predicted by sequence 
homology methods. In each of these examples, sequence homology methods correctly assign proteins 
to particular enzyme families whose members share similar enzyme activities. Thus, van de Loo et al. 
and Seffernick et al. do not provide any evidence that one of skill in the art would more likely than 
not doubt that NAPTR possesses the utilities of the NPT1 phosphate transporter. 

Seffernick et al. recognize that "functional assignments based on >50% sequence identity are 
considered to be reasonably sound" (Seffernick et al., page 2409, left column, paragraph 2). These 
authors state that their finding that "proteins with >98% sequence identity catalyze different reactions in 
different metabolic pathways is highly exceptional " (Id; emphasis added). This supports the fact that 
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while there may be a number of examples in which the assignment of function by sequence homology is 
not perfectly accurate, this does not contradict the findings of Bork that, in general sequence 
homology is an accurate method for assigning biological function. 

In further support of the rejection, the Office Action cites Bork as evidence that "predicting the 
function of a polypeptide encoded by a specific gene, by sequence database searches has a 
considerable error rate" (Office Action, September 20, 2002; page 4). However, this does not negate 
the fact that there is a 90% accuracy rate for the prediction of functional features by homology, as 
disclosed by Bork. At most the Office Action shows that errors can occur in functional assignment. 
The Bork reference does not show that errors do not occur, but it does quantify the error rate at about 
10%. The references cited in the Office Action show that there may be difficulties and errors involved 
in predicting protein function by homology. However, these references do not contradict the fact that 
such methods are accurate more often than not. As such, one of skill in the art would more likely 
than not believe that NAPTR has the utilities of the family of phosphate transporter proteins. 

As the cited evidence is completely insufficient to support the rejections of the claims, the 
outstanding rejections must be withdrawn for this reason alone. The only relevant evidence of record 
shows that a person of ordinary skill in the art would not doubt that the polypeptide encoded by the 
claimed polynucleotides is in fact a member of the family of phosphate transporter proteins, which are 
known to have specific utility. 

IV. By Requiring the Patent Applicant to Assert a Particular or Unique Utility, the Patent 
Examination Utility Guidelines and Training Materials Applied by the Patent 
Examiner Misstate the Law 

There is an additional, independent reason to withdraw the rejections: to the extent the 
rejections are based on Revised Interim Utility Examination Guidelines (64 FR 71427, December 21, 
1999), the final Utility Examination Guidelines (66 FR 1092, January 5, 2001) and/or the Revised 
Interim Utility Guidelines Training Materials (USPTO Website www.uspto.gov, March 1, 2000), the 
Guidelines and Training Materials are themselves inconsistent with the law. 

The Training Materials, which direct the Examiners regarding how to apply the Utility 
Guidelines, address the issue of specificity with reference to two kinds of asserted utilities: "specific" 
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utilities, which meet the statutory requirements, and "general" utilities, which do not. The Training 

Materials define a "specific utility" as follows: 

A [specific utility] is specific to the subject matter claimed. This contrasts to general 
utility that would be applicable to the broad class of invention. For example, a claim to 
a polynucleotide whose use is disclosed simply as "gene probe" or "chromosome 
marker" would not be considered to be specific in the absence of a disclosure of a 
specific DNA target. Similarly, a general statement of diagnostic utility, such as 
diagnosing an unspecified disease, would ordinarily be insufficient absent a disclosure of 
what condition can be diagnosed. 

The Training Materials distinguish between "specific" and "general" utilities by assessing 
whether the asserted utility is sufficiently "particular," i.e., unique (Training Materials at page 52) as 
compared to the "broad class of invention." (In this regard, the Training Materials appear to parallel 
the view set forth in Stephen G. Kunin, Written Description Guidelines and Utility Guidelines , 82 
J.P.T.O.S. 77, 97 (Feb. 2000) ("With regard to the issue of specific utility the question to ask is 
whether or not a utility set forth in the specification is particular to the claimed invention.").) 

Such •'unique" or "particular" utilities never have been required by the law. To meet the utility 
requirement, the invention need only be "practically useful," Nana, 480 F.2d 1 at 1397, and confer a 
"specific benefit" on the public. Brenner, 383 U.S. at 534. Thus incredible "throwaway" utilities, such 
as trying to "patent a transgenic mouse by saying it makes great snake food," do not meet this standard. 
Karen Hall, Genomic Warfare , The American Lawyer 68 (June 2000) (quoting John Doll, Chief of the 
Biotech Section of USPTO). 

This does not preclude, however, a general utility, contrary to the statement in the Training 
Materials where "specific utility" is defined (page 5). Practical real-world uses are not limited to uses 
that are unique to an invention. The law requires that the practical utility be "definite," not particular. 
Montedison, 664 F.2d at 375. Applicants are not aware of any court that has rejected an assertion of 
utility on the grounds that it is not "particular" or "unique" to the specific invention. Where courts have 
found utility to be too "general," it has been in those cases in which the asserted utility in the patent 
disclosure was not a practical use that conferred a specific benefit. That is, a person of ordinary skill in 
the art would have been left to guess as to how to benefit at all from the invention. In Kirk, for 
example, the CCPA held the assertion that a man-made steroid had "useful biological activity" was 
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insufficient where there was no information in the specification as to how that biological activity could be 
practically used. Kirk, 376 F.2d at 941. 

The fact that an invention can have a particular use does not provide a basis for requiring a 
particular use. See Brana, supra (disclosure describing a claimed antitumor compound as being 
homologous to an antitumor compound having activity against a "particular" type of cancer- was 
determined to satisfy the specificity requirement). "Particularity" is not and never has been the sine qua 
non of utility; it is, at most, one of many factors to be considered. 

As described supra, broad classes of inventions can satisfy the utility requirement so long as a 
person of ordinary skill in the art would understand how to achieve a practical benefit from knowledge 
of the class. Only classes that encompass a significant portion of nonuseful members would fail to meet 
the utility requirement. Supra § IILB (Montedison, 664 F.2d at 374-375). 

The Training Materials fail to distinguish between broad classes that convey information of 
practical utility and those that do not, lumping all of them into the latter, unpatentable category of 
"general" utilities. As a result, the Training Materials paint with too broad a brush. Rigorously applied, 
they would render unpatentable whole categories of inventions heretofore considered to be patentable, 
and that have indisputably benefitted the public, including the claimed invention. See supra § IILB. 
Thus the Training Materials cannot be applied consistently with the law. 

VI. Rejections under 35 U.S.C. § 112, second paragraph 

Claim 10 was rejected under 35 U.S.C. § 112, second paragraph, based on the allegation that 
the recitation of "[a] method of claim 9" is indefinite. This rejection is traversed. 

To expedite prosecution, claim 10 has been amended to recite "[t]he method of claim 9" as 
suggested by the Examiner. Applicants do not concede to the Patent Office position; Applicants are 
amending the claims solely to obtain expeditious allowance of the instant application. 

While not conceding to the Patent Office position, it is believed that claim 10, as amended, 
recites patentable subject matter. Therefore, withdrawal of this rejection is requested. 

Claims 12, 13, 46, 48, 57, and 58 were rejected under 35 U.S.C. § 112, second paragraph, 
based on the allegation that the recitation of "complementary" is indefinite. This rejection is traversed. 
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To expedite prosecution, claims 12, 13, 57, and 58 have been amended to recite "completely 
complementary ' as suggested by the Examiner. By these amendments, Applicants expressly do not 
disclaim equivalents of the invention which could include polynucleotides less than completely 
complementary to the recited polynucleotides. Applicants do not concede to the Patent Office position; 
Applicants are amending the claims solely to obtain expeditious allowance of the instant application 

While not conceding to the Patent Office position, it is believed that claims 12, 13, 57, and 58, 
as amended, and dependent claims 46 and 48, recite patentable subject matter. Therefore, withdrawal 
of this rejection is requested. 

Claim 48 was rejected under 35 U.S.C. § 112, second paragraph, based on the allegation that 
the recitation of "nucleotide molecules" is indefinite. This rejection is traversed. 

To expedite prosecution, claim 48 has been amended to recite "nucleic acid molecules" as 
suggested by the Examiner. Applicants do not concede to the Patent Office position; Applicants are 
amending the claims solely to obtain expeditious allowance of the instant application. 

While not conceding to the Patent Office position, it is believed that claim 48, as amended, 
recites patentable subject matter. Therefore, withdrawal of this rejection is requested. 

Claim 48 was further rejected under 35 U.S.C. § 1 12, second paragraph, based on the 
allegation that the recitation of "specifically hybridizable" is indefinite. The Office Action asserts that "it 
is unclear as to how complementary a polynucleotide must be to be 'specifically hybridizable with at 
least 30 contiguous nucleotides of a target polynucleotide' " (Office Action, September 20, 2000; page 
6, f 9). This rejection is traversed. 

Under the second paragraph of 35 U.S.C. § 112, the standard for "definiteness" is that the 
claims define patentable subject matter with a reasonable degree of precision and particularity. See In 
re Miller, 169 USPQ 597, 599 (CCPA 1971); In re Moore, 169 USPQ 236, 238 (CCPA 1971). 
See also M.P.E.P. § 706.03(d). In this regard, the Supreme Court has indicated that the primary 
purpose of claim language is to give "fair" notice of what would constitute the infringement of a claim. 
See United Carbon Co. v. Binny & Smith Co., 317 U.S. 228, 55 USPQ 381 (1942). In other 
words, the basic purpose of 35 U.S.C. § 1 12, second paragraph is to require a claim to reasonably 
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apprise those skilled in the art of the scope of the invention defined by that claim and give fair notice of 
what constitutes infringement of the claim. See Antonius v. Pro Group Inc., 217 USPQ 875, 877 
(6th Cir.1983). The present claims meet the legal standards required by 35 U.S.C. § 112, second 
paragraph. 

The term "hybridization" is defined in the specification at, for example, page 6, lines 30-31. 
"Specific" hybridization, or binding, is discussed in the specification at, for example, page 7, lines 23- 
30. A nucleic acid molecule which is "specifically hybridizable" to a target polynucleotide hybridizes 
"specifically" or "selectively" to that target polynucleotide. The specification provides further support 
for specifically hybridizable probes at, for example, page 31, lines 6-26 and page 40, lines 6-24. 

The degree of complementarity necessary for a polynucleotide to be "specifically hybridizable" 
to a target polynucleotide could be ascertained by one of skill in the art by considering the phrase 
"specifically hybridizable" in the context of claim 48. This claim recites an array which can be used to 
detect a target polynucleotide, wherein the detection of such a target polynucleotide relies upon the 
formation of a specific hybridization complex between a probe polynucleotide and the target 
polynucleotide. One of skill in the art would understand that the hybridization of the probe and target 
polynucleotides would require a certain degree of specificity in order for the claimed array to function 
effectively in the detection of the target polynucleotide. Furthermore, one of skill in the art would 
reasonably conclude that the degree of complementarity is that which is necessary to achieve the 
requisite specificity of hybridization for operability of the claimed array. Therefore, a person of skill in 
the art would reasonably understand the metes and bounds of the phrase "specifically hybridizable" in 
the context of the recited array . 

For at least the above reasons, withdrawal of this rejection under 35 U.S.C. § 1 12, second 
paragraph, is requested. 

VII. Written description rejection under 35 U.S.C. § 1 12, first paragraph 

Claims 3, 6, 7, 9, 12, 13, 46, 48, 57, and 58 were rejected under 35 U.S.C. § 1 12, first 
paragraph, as being based on a specification which allegedly fails to reasonably convey to one of skill in 
the art that the Applicants had possession of the claimed invention at the time the application was filed. 
The Office Action asserts that "[t]he specification discloses only a single species of the claimed genus, 
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i.e., SEQ ID NO:2 encoding SEQ ID NO:l, which is insufficient to put one of skill in the art in 

possession of the attributes and features of all species within the claimed genus" (Office Action, 

September 20, 2002; page 7). This rejection is traversed. 

The requirements necessary to fulfill the written description requirement of 35 U.S.C. § 112, 

first paragraph, are well established by case law. 

... the applicant must also convey with reasonable clarity to those skilled in the art that, 
as of the filing date sought, he or she was in possession of the invention. The invention 
is, for purposes of the "written description" inquiry, whatever is now claimed. 
Vas-Cath, Inc. v. Mahurkar, 19 USPQ2d 1111, 1117 (Fed. Cir. 1991) 

Attention is also drawn to the Patent and Trademark Office's own "Guidelines for Examination 
of Patent Applications Under the 35 U.S.C. Sec. 112, para. 1", published January 5, 2001, which 
provide that: 

An applicant may also show that an invention is complete by disclosure of sufficiently 
detailed, relevant identifying characteristics 42 which provide evidence that applicant was 
in possession of the claimed invention, 43 i.e., complete or partial structure, other 
physical and/or chemical properties, functional characteristics when coupled with a 
known or disclosed correlation between function and structure, or some combination of 
such characteristics. 44 What is conventional or well known to one of ordinary skill in the 
art need not be disclosed in detail 45 If a skilled artisan would have understood the 
inventor to be in possession of the claimed invention at the time of filing, even if every 
nuance of the claims is not explicitly described in the specification, then the adequate 
description requirement is met. 46 

Thus, the written description standard is fulfilled by both what is specifically disclosed and what 
is conventional or well known to one skilled in the art. 



A. The specification provides an adequate written description of the claimed "variants" and 
"fragments" of SEQ ID NO:l and SEQ ID NO:2. 

The subject matter encompassed by claims 3, 6, 7, 9, 12, 13, 46, 48, 57, and 58 is either 
disclosed by the specification or is conventional or well known to one skilled in the art. 

First note that the "variant" language of independent claim 3 recites a polynucleotide encoding a 
polypeptide "comprising a naturally-occurring amino acid sequence at least 90% identical to the amino 
acid sequence of SEQ ID NO:l," and the "variant" language of independent claim 12 recites "a 
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polynucleotide comprising a naturally-occurring polynucleotide sequence at least 90% identical to the 
sequence of SEQ ID NO:2." Furthermore, the "fragment" language of independent claim 3 recites a 
polynucleotide encoding a "fragment of a polypeptide having the amino acid sequence of SEQ ID 
NO:l, wherein said fragment transports phosphate," and the "fragment" language of independent claim 
13 recites a polynucleotide comprising at least 20 contiguous polynucleotides of "a polynucleotide 
consisting of nucleotides 1 183 through 1454 of the polynucleotide sequence of SEQ ID NO:2." The 
amino acid sequence of SEQ ID NO:l and the polynucleotide sequence of SEQ ID NO:2 are explicitly 
disclosed in the specification. See, for example, the Sequence Listing. Variants of SEQ ID NO:l and 
SEQ ID NO:2 are described in the Specification at, for example, page 3, lines 5-7; page 4, lines 29- 
32; page 5, lines 5-8 and 15-23; page 9, lines 28-30; page 11, lines 5-8 and 14-21; page 12, lines 3-4 
and 1 1-30; and page 14, line 22 to page 15, line 12; and fragments of SEQ ID NO:l and SEQ ID 
NO:2 are described at, for example, page 3, lines 8-11; page 4, lines 23-28; page 8, lines 21-25; page 
11, lines 32-33; page 14, lines 19-21; page 20, lines 10-13; page 23, lines 23-29; page 40, lines 8-10; 
page 41, lines 2-5; and page 42, lines 7-10. The portion of SEQ ID NO:2 consisting of nucleotides 
1183 through 1454 corresponds to the polynucleotide disclosed as Incyte Clone 754412. est at, for 
example, page 38, lines 1 1-12 and 24-25, and shown as SEQ ID NO:5 in the Sequence Listing. In 
addition, a specific assay to measure phosphate transport is disclosed in the Specification at, for 
example, page 41, lines 21-30. 

One of ordinary skill in the art would recognize polynucleotide sequences which are variants 
having a polynucleotide sequence at least 90% or 95% identical to SEQ ID NO:2, or which encode 
polypeptide variants having an amino acid sequence at least 90% identical to SEQ ID NO:l. Given any 
naturally occurring polynucleotide sequence, it would be routine for one of skill in the art to recognize 
whether it was a variant of SEQ ID NO:2, and whether it encoded a variant of SEQ ID NO:l. 
Accordingly, the specification provides an adequate written description of the recited polynucleotide 
variants of SEQ ID NO:2 and polynucleotides encoding polypeptide variants of SEQ ID NO:l. 

One of ordinary skill in the art would recognize polynucleotide sequences which are fragments 
comprising at least 20 contiguous nucleotides of the portion of SEQ ID NO:2 consisting of nucleotides 
1 183 through 1454 (i.e., SEQ ID NO:5), or which encode polypeptide sequences which are fragments 
of SEQ ID NO:l. The information provided by SEQ ID NO:l, SEQ ID NO:2, and SEQ ID NO:5 
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provides the necessary framework for the recited fragments - to recite every possible fragment would 
needlessly clutter the application. Furthermore, it would be routine for one of skill in the art to 
determine whether any particular fragment of SEQ ID NO:l had phosphate transport activity, using the 
disclosed phosphate transport assay. Accordingly, the specification provides an adequate written 
description of the recited polynucleotide fragments of SEQ ID NO:2, and polynucleotides encoding the 
recited fragments of SEQ ID NO:l. 

1. The present claims specifically define the claimed genus through the recitation of 
chemical structure 

Court cases in which "DNA claims" have been at issue (which are hence relevant to claims to 
proteins encoded by the DNA) commonly emphasize that the recitation of structural features or 
chemical or physical properties are important factors to consider in a written description analysis of such 
claims. For example, in Fiers v. Revel, 25 USPQ2d 1601, 1606 (Fed. Cir. 1993), the court stated 
that: 

If a conception of a DNA requires a precise definition, such as by structure, formula, 
chemical name or physical properties, as we have held, then a description also requires 
that degree of specificity. 

In a number of instances in which claims to DNA have been found invalid, the courts have 

noted that the claims attempted to define the claimed DNA in terms of functional characteristics without 

any reference to structural features. As set forth by the court in University of California v. Eli Lilly 

and Co., 43 USPQ2d 1398, 1406 (Fed. Cir. 1997): 

In claims to genetic material, however, a generic statement such as "vertebrate insulin 
cDNA" or "mammalian insulin cDNA," without more, is not an adequate written 
description of the genus because it does not distinguish the claimed genus from others, 
except by function. 

Thus, the mere recitation of functional characteristics of a DNA, without the definition of 
structural features, has been a common basis by which courts have found invalid claims to DNA For 
example, in Lilly, 43 USPQ2d at 1407, the court found invalid for violation of the written description 
requirement the following claim of U.S. Patent No. 4,652,525: 
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1. A recombinant plasmid replicable in procaryotic host containing within its nucleotide 
sequence a subsequence having the structure of the reverse transcript of an mRNA of a 
vertebrate, which mRNA encodes insulin. 

In Fiers, 25 USPQ2d at 1603, the parties were in an interference involving the following count: 

A DNA which consists essentially of a DNA which codes for a human fibroblast 
interferon-beta polypeptide. 

Party Revel in the Fiers case argued that its foreign priority application contained an adequate 
written description of the DNA of the count because that application mentioned a potential method for 
isolating the DNA. The Revel priority application, however, did not have a description of any particular 
DNA structure corresponding to the DNA of the count. The court therefore found that the Revel 
priority application lacked an adequate written description of the subject matter of the count. 

Thus, in Lilly and Fiers, nucleic acids were defined on the basis of functional characteristics and 

were found not to comply with the written description requirement of 35 U.S.C. § 112; i.e., "an mRNA 

of a vertebrate, which mRNA encodes insulin" in Lilly, and "DNA which codes for a human fibroblast 

interferon-beta polypeptide" in Fiers. In contrast to the situation in Lilly and Fiers, the claims at issue 

in the present application define polynucleotides and polypeptides in terms of chemical structure, rather 

than functional characteristics. For example, the language of independent claims 3 and 12 recites 

chemical structure to define the claimed genus: 

3. An isolated polynucleotide encoding a polypeptide selected from the group 
consisting of: 

a) a polypeptide comprising the amino acid sequence of SEQ ID NO:l, 

b) a polypeptide comprising a naturally-occurring amino acid sequence at least 90% 

identical to the amino acid sequence of SEQ ID NO:l, and 

c) a fragment of a polypeptide having the amino acid sequence of SEQ ID NO:l, 

wherein said fragment transports phosphate. 

12. Ah isolated polynucleotide selected from the group consisting of: 

a) a polynucleotide comprising the polynucleotide sequence of SEQ ID NO:2, 

b) a polynucleotide comprising a naturally-occurring polynucleotide sequence at 

least 90% identical to the polynucleotide sequence of SEQ ID NO:2, 

c) a polynucleotide completely complementary to a polynucleotide of a), 

d) a polynucleotide completely complementary to a polynucleotide of b), and 

e) an RNA equivalent of a)-d). 
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From the above it should be apparent that the claims of the subject application are 
fundamentally different from those found invalid in Lilly and Fiers. The subject matter of the present 
claims is defined in terms of the chemical structure of SEQ ID NO:l and SEQ ID NO:2. In the present 
case, there is no reliance merely on a description of functional characteristics of the polynucleotides and 
polypeptides. The polynucleotides defined by the claims of the present application recite structural 
features, and cases such as Lilly and Fiers stress that the recitation of structure is an important factor to 
consider in a written description analysis of claims of this type. By failing to base the written description 
inquiry "on whatever is now claimed," the Patent Office failed to provide an appropriate analysis of the 
present claims and how they differ from those found not to satisfy the written description requirement in 
Lilly and Fiers. 

2. The present claims do not define a genus which is "highly variant" 

Furthermore, the claims at issue do not describe a genus which could be characterized as 
"highly variant." Available evidence illustrates that, rather than being a large variable genus, the claimed 
genus is of narrow scope. 

In support of this assertion, the Examiner's attention is directed to the enclosed reference by 
Brenner et al. ("Assessing sequence comparison methods with reliable structurally identified distant 
evolutionary relationships," Proc. Natl. Acad. ScL USA (1998) 95:6073-6078). Through exhaustive 
analysis of a data set of proteins with known structural and functional relationships and with <90% 
overall sequence identity, Brenner et al. have determined that 30% identity is a reliable threshold for 
establishing evolutionary homology between two sequences aligned over at least 150 residues (Brenner 
et al., pages 6073 and 6076). Furthermore, local identity is particularly important in this case for 
assessing the significance of the alignments, as Brenner et al. further report that ^40% identity over at 
least 70 residues is reliable in signifying homology between proteins (Brenner et al., page 6076). 

The present application is directed, inter alia, to polynucleotides encoding phosphate 
transporter proteins, including polynucleotides encoding phosphate transporter proteins related to the 
amino acid sequence of SEQ ID NO:l. In accordance with Brenner et al., naturally occurring 
molecules may exist which could be characterized as phosphate transporter proteins and which have as 
little as 30% identity over at least 150 residues to SEQ ID NO:l. The 'Variant language" of the present 
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claims recites a polynucleotide encoding a polypeptide comprising "a naturally-occurring amino acid 
sequence at least 90% identical to the amino acid sequence of SEQ ID NO:l" (note that SEQ ID 
NO:l has 401 amino acid residues). This variation is far less than that of polynucleotides encoding all 
potential phosphate transporter proteins related to SEQ ID NO:l, i.e., those phosphate transporter 
proteins having as little as 30% identity over at least 150 residues to SEQ ID NO:l. 

3. The state of the art at the time of the present invention is further advanced than at 
the time of the Lilly and Fiers applications 

In the Lilly case, claims of U.S. Patent No. 4,652,525 were found invalid for failing to comply 
with the written description requirement of 35 U.S.C. § 1 12. The '525 patent claimed the benefit of 
priority of two applications, Application Serial No. 801,343 filed May 27, 1977, and Application Serial 
No. 805,023 filed June 9, 1977. In the Fiers case, party Revel claimed the benefit of priority of an 
Israeli application filed on November 21, 1979. Thus, the written description inquiry in those cases was 
based on the state of the art at essentially the "dark ages" of recombinant DNA technology. 

The present application has a priority date of February 24, 1997. Much has happened in the 
development of recombinant DNA technology in the 20 or so years from the time of filing of the 
applications involved in Lilly and Fiers and the present application. For example, the technique of 
polymerase chain reaction (PCR) was invented. Highly efficient cloning and DNA sequencing 
technology has been developed. Large databases of protein and nucleotide sequences have been 
compiled. Much of the raw material of the human and other genomes has been sequenced. With these 
remarkable advances, one of skill in the art would recognize that, given the sequence information of 
SEQ ID NO:l and SEQ ID NO:2, and the additional extensive detail provided by the subject 
application, the present inventors were in possession of the claimed polynucleotide variants and 
fragments at the time of filing of this application. 

4. Summary 

The Office Action failed to base the written description inquiry "on whatever is now claimed." 
Consequently, the Office Action did not provide an appropriate analysis of the present claims and how 
they differ from those found not to satisfy the written description requirement in cases such as Lilly and 
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Fiers, In particular, the claims of the subject application are fundamentally different from those found 
invalid in Lilly and Fiers. The subject matter of the present claims is defined in terms of the chemical 
structure of SEQ ID NO:l and SEQ ID NO:2. The courts have stressed that structural features are 
important factors to consider in a written description analysis of claims to nucleic acids and proteins. In 
addition, the genus of polynucleotides defined by the present claims is adequately described, as 
evidenced by Brenner et al. Furthermore, there have been remarkable advances in the state of the art 
since the Lilly and Fiers cases, and these advances were given no consideration whatsoever in the 
position set forth by the Office Action. 

For at least the reasons set forth above, the specification provides an adequate written 
description of the claimed subject matter, and this rejection should be withdrawn. 

VIII. Enablement rejection under 35 U.S.C. § 112, first paragraph 

Claims 3, 6, 7, 9, 12, 13, 46, 48, 57, and 58 were rejected under 35 U.S.C. § 112, first 
paragraph, based on the allegation that the specification does not describe the subject matter of the 
invention in such a way as to enable one of skill in the art to make and/or use the claimed variants and 
fragments (Office Action, September 20, 2002; page 8). In particular, the Office Action asserts that 
the Specification does not describe how to make and/or use polynucleotides encoding naturally 
occurring polypeptides at least 90% identical to SEQ ID NO:l, polynucleotides comprising naturally 
occurring sequences at least 90% or 95% identical to SEQ ID NO:2, polynucleotides comprising at 
least 20 or 60 contiguous nucleotides of SEQ ID NO: 2 or variants thereof, and arrays comprising a 
nucleic acid molecule that specifically hybridizes with at least 30 contiguous nucleotides of the recited 
polynucleotides. Such, however, is not the case. 

With respect to the claimed variants, the Office Action asserts that "the specification does not 
establish: (A) regions of the nucleic acid structure which may be modified without affecting the encoded 
polypeptide activity; (B) the general tolerance of SEQ ID NO:2 to modification and extent of such 
tolerance; (C) a rational and predictable scheme for modifying any residues of SEQ ID NO:2 with an 
expectation of obtaining the desired biological function; and (D) the specification provides insufficient 
guidance as to which of the essentially infinite possible choices is likely to be successful" (Office Action, 
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September 20, 2002; pages 10-11). Note that claim 3, for example, recites not only that the 
polynucleotides encode polypeptides which are at least 90% identical to SEQ ID NO:l, but also that 
they have "a naturally occurring amino add sequence ." Through the process of natural selection, 
nature will have determined the appropriate amino acid sequences. Given the information provided by 
SEQ ID NO:l (the amino acid sequence of NAPTR) and SEQ ID NO:2 (the polynucleotide sequence 
encoding NAPTR), one of skill in the art would be able to routinely obtain "a naturally occurring amino 
acid sequence at least 90% identical to the amino acid sequence of SEQ ID NO:l." For example, the 
identification of relevant polynucleotides could be performed by hybridization and/or PCR techniques 
that were well-known to those skilled in the art at the time the subject application was filed and/or 
described throughout the Specification of the instant application. See, e.g., page 13, line 7 to page 14, 
line 8; page 33, lines 9-31; and Example VI at page 40. Thus, one skilled in the art need not make and 
test vast numbers of polynucleotides that encode polypeptides based on the amino acid sequence of 
SEQ ID NO:l, or vast numbers of polynucleotides based on the polynucleotide sequence of SEQ ID 
NO:2. Instead, one skilled in the art need only screen a cDNA library or use appropriate PCR 
conditions to identify relevant polynucleotides, and their encoded polypeptides, that already exist in 
nature. By extension, one of skill in art could make fragments of naturally occurring polynucleotides at 
least 90% identical to SEQ ID NO:2, and could use such fragments, for example, as hybridization 
probes to detect full-length naturally occurring polynucleotides at least 90% identical to SEQ ID NO:2. 
In addition, one of skill in the art would be able to routinely obtain probes specifically hybridizable to at 
least 30 contiguous nucleotides of polynucleotides at least 90% identical to SEQ ID NO:2 
(Specification, e.g., at page 31, lines 19-26), and could make arrays comprising such probes and use 
them, for example, to detect full-length naturally occurring polynucleotides at least 90% identical to 
SEQ ID NO:2. 

The Office Action asserts that "modifications to an encoding nucleic acid, even minor 
modifications, may completely alter the function of the encoded protein sequence" (Office Action, 
September 20, 2000; page 10; emphasis added), and cites Broun et al. (Science (1998) 282:1315- 
1317), Seffernick et al. (J. Bacteriol. (2001) 183:2405-2410), and Bork (Genome Res. (2000) as 
support. These assertions demonstrate that the Office Action has based the alleged lack of enablement 
on the mere possibility that mutations can sometimes eliminate the biological function of a naturally 
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occurring polypeptide. This conclusion ignores the teachings of Brenner et al. (Proc. Natl. Acad. Sci. 
USA (1998) 95:6073-6078; of record), which speaks to the general applicability of using sequence 
homology as low as 30% over 150 amino acid residues, and as low as 40% over 70 amino acid 
residues, to indicate protein homology, and Bork (cited by the Office Action), which teaches that the 
prediction of functional features by homology has a 90% accuracy rate, and that the accuracy rate for 
all bioinformatics predictions has a 70% accuracy rate (Table 1 of Bork). 

For example, the Office Action states that Broun et al. teach that "as few as four amino acid 
substitutions in a polypeptide having approximately 380 amino acids completely alters the enzymatic 
function of the polypeptide from a desaturase to a hydroxylase" (Office Action, September 20, 2002; 
page 10). Broun et al. disclose that "only four changes are required to convert a strict desaturase to an 
enzyme that retains some desaturase activity but is also an efficient hydroxylase" (Broun et al., page 
1317, left column, 1st paragraph; emphasis added). Thus, the mutations do not completely alter the 
enzymatic function of the polypeptide, as asserted by the Office Action. The mutant polypeptide can 
still be used as a desaturase . Broun et al. also note that "a small number of amino acid substitutions 
will account for the functional divergence of desaturases, hydroxylases, expoxgenases [sic], and 
acetylenic bond-forming enzymes" (Broun et al., page 1317, left column, 3rd paragraph). This supports 
the notion that most amino acid substitutions have no effect or minimal effect on protein function. 

The Office Action cites Seffernick et al. as evidence that "two polypeptides encoded by 
naturally-occurring polynucleotides, while sharing significant sequence homology, may have completely 
different functions" (Office Action, September 20, 2002; page 10). Seffernick et al. describe a 
melamine deaminase and an atrazine chlorohydrolase that share 98% sequence identity and yet have 
different substrate specificities. These two enzymes both belong to the amidohydrolase enzyme 
superfamily whose members catalyze the hydrolytic displacement of amino groups or chlorine 
substituents from triazine ring compounds (e.g., Seffernick et al, page 2409, right column, second 
paragraph). Notably, there is at least one member of the amidohydrolase superfamily that catalyzes 
both deamination and dechlorination reactions with triazine ring substrates (Id). Therefore, the 98% 
sequence homology between melamine deaminase and atrazine chlorohydrolase correctly predicts their 
functional similarity and their membership in a common enzyme family. 
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This example in which it is difficult to obtain a precise functional prediction does not contradict 
the findings of Bork that, in the majority of cases, protein function is accurately predicted by sequence 
homology methods. In the Seffernick example, sequence homology methods correctly assign proteins 
to a particular enzyme family whose members share similar enzyme activities. Thus, Seffernick et al. do 
not contradict the evidence that one of skill in the art would reasonably conclude that NAPTR could be 
used in the same manner as the NPT1 phosphate transporter. 

Seffernick et al. recognize that "functional assignments based on >50% sequence identity are 
considered to be reasonably sound" (Seffernick et al., page 2409, left column, paragraph 2). These 
authors state that their finding that "proteins with >98% sequence identity catalyze different reactions in 
different metabolic pathways is highly exceptional " (Id; emphasis added). This supports the fact that 
while there may be a number of examples in which the assignment of function by sequence homology is 
not perfectly accurate, this does not contradict the findings of Bork that, in general , sequence 
homology is an accurate method for assigning biological function. 

In further support of the rejection, the Office Action cites Bork as evidence that "predicting the 
function of a polypeptide encoded by a specific gene, by sequence database searches has a 
considerable error rate" (Office Action, September 20, 2002; page 10). However, this does not 
negate the fact that there is a 90% accuracy rate for the prediction of functional features by homology, 
as disclosed by Bork. At most the Office Action shows that errors can occur in functional assignment. 
The Bork reference does not show that errors do not occur, but it does quantify the error rate at about 
10%. The references cited in the Office Action show that there may be difficulties and errors involved 
in predicting protein function by homology. However, these references do not contradict the fact that 
such methods are accurate more often than not. As such, one of skill in the art would reasonably 
conclude that NAPTR possesses the functions of the family of phosphate transporter proteins. 

The Office Action has failed to demonstrate that one of skill in the art could not make and use 
the claimed polynucleotides encoding polypeptide variants comprising naturally occuring amino acid 
sequences at least 90% identical to SEQ ID NO:l. The Office Action has only provided isolated 
examples in which mutations can sometimes result in a shift of the biological activity of a naturally 
occuring polypeptide to a related biological activity found in other members of the polypeptide family. 
The cited references have no bearing on the ability of a skilled artisan to screen a cDNA library or use 
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appropriate PCR conditions to identify relevant polynucleotides, and their encoded polypeptides, that 
already exist in nature, without undue experimentation. 

With respect to the claimed arrays, the Office Action has not provided any arguments 
concerning the alleged lack of enablement of arrays comprising nucleic acid molecules "specifically 
hybridizable with at least 30 contiguous nucleotides" of a polynucleotide comprising SEQ ID NO:2. 
One of skill in the art could make and use the claimed arrays without undue experimentation, based on 
the Specification and the state of the art at the time the application was filed. For example, one of skill 
in the art would know how to make the claimed arrays by producing specific hybridization probes for a 
polynucleotide comprising SEQ ID NO:2 (Specification e.g., at page 31, lines 19-26). In addition, one 
of skill in the art would know how to use the claimed arrays to detect the presence of a polynucleotide 
comprising SEQ ID NO:2 (Specification, e.g., at page 31, lines 6-26; and Example VI at page 40). 

With respect to the claimed fragments, the Office Action has not provided any arguments 
concerning the alleged lack of enablement of polynucleotides "comprising at least 20 [or 60] contiguous 
nucleotides" of SEQ ID NO:2. One of skill in the art could make and use the claimed polynucleotide 
fragments without undue experimentation, based on the Specification and the state of the art at the time 
the application was filed. For example, one of skill in the art would know how to use the claimed 
polynucleotide fragments as hybridization probes or PCR probes to detect the presence of a 
polynucleotide comprising SEQ ID NO:2 (Specification, e.g., at page 33, lines 9-31; and Example VI 
at page 40). 

As set forth in In re Marzocchi, 169 USPQ 367, 369 (CCPA 1971): 

The first paragraph of § 112 requires nothing more than objective enablement. How 
such a teaching is set forth, either by the use of illustrative examples or by broad 
terminology, is of no importance. 

As a matter of Patent Office practice, then, a specification disclosure which contains a 
teaching of the manner and process of making and using the invention in terms which 
correspond in scope to those used in describing and defining the subject matter sought 
to be patented must be taken as in compliance with the enabling requirement of the first 
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paragraph of § 1 12 unless there is reason to doubt the objective truth of the statements 
contained therein which must be relied on for enabling support. 

Contrary to the standard set forth in Marzocchi, the Office Action has failed to provide any 
reasons why one would doubt that the guidance provided by the present Specification would enable 
one to make and use the recited polynucleotides encoding polypeptide variants of SEQ ID NO: 1 , the 
recited polynucleotide variants and fragments of SEQ ID NO:2, or the recited arrays comprising nucleic 
acid molecules specifically hybridizable to portions of the recited polynucleotides. Hence, a prima 
facie case for non-enablement has not been established with respect to the recited variants and 
fragments of SEQ ID NO:l and SEQ ID NO:2. 

For at least the above reasons, withdrawal of this rejection is requested. 

IX. Rejection under 35 U.S.C. § 102(a) 

Claims 3, 13, and 58 were rejected under 35 U.S.C. § 102(a) because the recited 
polynucleotides are allegedly anticipated by Gasparini (GenBank Accession Number Z83593). The 
Office Action asserts that "Gasparini teaches a polynucleotide encoding amino acids 96 to 193 of SEQ 
ID NO:l and would anticipate claim 3 part d) as being an immunogenic fragment ... of SEQ ID NO:l. 
The polynucleotide of Gasparini is 100% identical to nucleotides 519 to 814 of SEQ ID NO:2 and 
would anticipate part a) of claims 13 and 58" (Office Action, September 20, 2000; pages 11-12). This 
rejection is traversed. 

To expedite prosecution, claim 3 has been amended such that it does not recite immunogenic 
fragments of SEQ ID NO:l. By this amendment, Applicants expressly do not disclaim equivalents of 
the invention which could include polynucleotides encoding immunogenic fragments of SEQ ID NO:l. 
Applicants do not concede to the Patent Office position; Applicants are amending the claim solely to 
obtain expeditious allowance of the instant application. 

To expedite prosecution, claims 13 and 58 have been amended such that they recite fragments 
of the portion of SEQ ID NO:2 consisting of nucleotides 1 183 through 1454. Support for this 
amendment can be found in the specification at, for example, page 38, lines 1 1-12 and 24-25; and in 
the Sequence Listing. For example, the portion of SEQ ID NO:2 consisting of nucleotides 1 183 
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through 1454 corresponds to the polynucleotide sequence disclosed in the specification as Incyte Clone 
754412.est, and shown as SEQ ID NO:5 in the Sequence Listing. By these amendments, Applicants 
expressly do not disclaim equivalents of the invention which could include polynucleotides comprising 
fragments consisting of at least 20 contiguous nucleotides of SEQ ID NO:2, or polynucleotides 
comprising fragments consisting of at least 20 contiguous nucleotides of a polynucleotide at least 90% 
identical to SEQ ID NO:2. Applicants do not concede to the Patent Office position; Applicants are 
amending the claims solely to obtain expeditious allowance of the instant application. 

While not conceding to the Patent Office position, it is believed that claims 3, 13, and 58, as 
amended, recite patentable subject matter. Therefore, withdrawal of this rejection is requested. 

X. Obviousness-tvpe double patenting over U.S. Patent 5,985,604 

Claims 3-7, 9, 10, 12, and 57 were rejected under the judicially created doctrine of 
obviousness- type double patenting over claims 1-8 of U.S. Patent No. 5,985,604 (hereinafter "the 
'604 patent"). Applicants request that the requirement for submission of a Terminal Disclaimer with 
respect to the '604 patent be held in abeyance until such time that there is an indication of allowable 
subject matter. 
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CONCLUSION 



In light of the above amendments and remarks, Applicants submit that the present application is 
fully in condition for allowance, and request that the Examiner withdraw the outstanding rejections. 
Early notice to that effect is earnestly solicited. 

If the Examiner contemplates other action, or if a telephone conference would expedite 
allowance of the claims, Applicants invite the Examiner to contact the undersigned at (650) 621-8581. 

Applicants believe that no fee is due with this communication. However, if the USPTO 
determines that a fee is due, the Commissioner is hereby authorized to charge Deposit Account No. 09- 



0108. 



Respectfully submitted, 



INCYTE GENOMICS, INC. 



Date: 




3160 Porter Drive 
Palo Alto, California 94304 
Phone: (650) 855-0555 
Fax: (650) 849-8886 
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VERSION WITH MARKINGS TO SHOW CHANGES MADE 



IN THE SPECIFICATION 

The title of the application has been amended as follows: 

POLYNUCLEOTIDES ENCODING A [NOVEL] HUMAN SODIUM-DEPENDENT 
PHOSPHATE COTRANSPORTER 

IN THE CLAIMS: 

Claim 1 has been canceled, without prejudice or disclaimer. 

Claims 3, 10, 12, 13, 48, 57, and 58 have been amended as follows: 

3. (Twice Amended) An isolated polynucleotide encoding a polypeptide selected from the 
group consisting of: 

a) a polypeptide comprising the amino acid sequence of SEQ ID NO:l, 

b) a polypeptide comprising a [naturally occurring] naturally-occurring amino acid sequence at 
least 90% identical to the amino acid sequence of SEQ ID NO:l, and 

c) a fragment of a polypeptide having the amino acid sequence of SEQ ID NO:l, wherein said 
fragment transports phosphate[, and 

d) an immunogenic fragment of a polypeptide having the amino acid sequence of SEQ ID 

NO:l]. 

10. (Once Amended) [A] The method of claim 9, wherein the polypeptide has the amino acid 
sequence of SEQ ID NO:l. 
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12. (Once Amended) An isolated polynucleotide selected from the group consisting of: 

a) a polynucleotide comprising the polynucleotide sequence of SEQ ID NO:2, 

b) a polynucleotide comprising a [naturally occurring] naturally- occurring polynucleotide 
sequence at least 90% identical to the polynucleotide sequence of SEQ ID NO:2, 

c) a polynucleotide completely complementary to a polynucleotide of a), 

d) a polynucleotide completely complementary to a polynucleotide of b), and 

e) an RNA equivalent of a)-d). 

13. (Twice Amended) An isolated polynucleotide comprising at least 20 contiguous 
nucleotides of a polynucleotide selected from the group consisting of: 

a) a polynucleotide [comprising] consisting of nucleotides 1183 through 1454 of the 
polynucleotide sequence of SEQ ID NO:2, 

b) a polynucleotide [comprising a naturally occurring] consisting of a naturally- occurring 
polynucleotide sequence at least 90% identical to nucleotides 1183 through 1454 of the polynucleotide 
sequence of SEQ ID NO:2, 

c) a polynucleotide completely complementary to a polynucleotide of a), 

d) a polynucleotide completely complementary to a polynucleotide of b), and 

e) an RNA equivalent of a)-d). 

48. (Once Amended) An array comprising different [nucleotide] nucleic acid molecules affixed 
in distinct physical locations on a solid substrate, wherein at least one of said [nucleotide] nucleic acid 
molecules comprises a first oligonucleotide or polynucleotide sequence specifically hybridizable with at 
least 30 contiguous nucleotides of a target polynucleotide, and wherein said target polynucleotide is a 
polynucleotide of claim 12. 

57. (Twice Amended) A polynucleotide of claim 12, selected from the group consisting of: 

a) a polynucleotide comprising the polynucleotide sequence of SEQ ID NO:2, 

b) a polynucleotide comprising a [naturally occurring] naturally-occurring polynucleotide 
sequence at least 95% identical to the polynucleotide sequence of SEQ ID NO:2, 



103594 



50 



09/991,212 



Docket No.: PF-0221-3DIV 

c) a polynucleotide completely complementary to a polynucleotide of a), 

d) a polynucleotide completely complementary to a polynucleotide of b), and 

e) an RNA equivalent of a)-d). 

58. (Once Amended) An isolated polynucleotide of claim 13, comprising at least 60 
contiguous nucleotides of a polynucleotide selected from the group consisting of: 

a) a polynucleotide [comprising] consisting of nucleotides 1 183 through 1454 of the 
polynucleotide sequence of SEQ ID NO:2, 

b) a polynucleotide [comprising a naturally occurring] consisting of a naturally-occurring 
polynucleotide sequence at least 90% identical to nucleotides 1183 through 1454 of the polynucleotide 
sequence of SEQ ID NO:2, 

c) a polynucleotide completely complementary to a polynucleotide of a), 

d) a polynucleotide completely complementary to a polynucleotide of b), and 

e) an RNA equivalent of a)-d). 

New claim 59 has been added as follows: 

59. (New) An isolated polynucleotide of claim 13, comprising at least 20 contiguous 
nucleotides of a polynucleotide selected from the group consisting of: 

a) a polynucleotide consisting of the polynucleotide sequence of SEQ ID NO:5, 

b) a polynucleotide completely complementary to a polynucleotide of a)', and 

c) an RNA equivalent of a)-b). 
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control cDNA. In order to further ennch those species differemullv expressed .a 
the tester cDNA. the subtracted tester popuUnon is amplified bv PCR f .lowing 
every second subtraction cycle. After six cycles of subtraction (three reamplincat.on 
steps) the reaction mix is ligared into a vector for further analv,,, 

In a shghtly different approach. Hara et al. (1991) utilized a method uherebv 
oiigo(dT„) pmners attached to a latex substrate are used to first capture mRNA 
RnT?,! Tr 1 T Wl p °P uUtion - Following i„ , trand cDNA svnthesis. the 
fu«l^ of the heteroduplexe, i, removed by heat denaturanon and centn- 
fugation (the cDNA-ohgotex-dT,. forms a pellet and the supernatant ,s removed). 

, driv^nvIT T u N A U thCn rePea * edly h >- b " d '«* « ^e .mmob.hzed control 
(driver) cDNA (which » present in 20-fold excess). After several rounds of 

m ^ NA m0leCU,e, ' eft " the le »« mRNA population are 
es« r lottcmRNi 0 ^^otex-dT,. populat.on. These 

tester-specific mRNA species are then converted to cDNA and. following the 
addition of adaptor sequence,, ampl.ned by PCR. The PCR product, are then 

PCR trZ~T°l an " ly,,S U " ng fe$triction "« '""rporated into the 

PCR pnmers. A schematic illustration of this subtraction process .s shown in figure 

However, all these methods utilising physical separator, have been described a. 

17s J"' ^ n ^ TemeRt for la '* e "»™S amounts of mRNA. significant 
loss of material during the separation process and a need for several r unds f 
hybridization Hence, new method, of differential expression analvsi, have recently 
been designed to eliminate these problems. 
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Chemical Cross-Linking Subtraction fCCLS) 

rrlL^" teC K hni<,Ue - descrioed b >- Hampson \t al. (1992). J driver mRNA 

.s muted with tester cDNA (1st strand only) in, a ratio of > 20: 1. The c mmon 

S n ed S cD^A C l DNA r f ^ h> t ,dS - ' MVin6 tHe SpeC ' fic *P-« a,r g .e 
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east 300-fold with one round of subtraction (Hampson et al. 1992). and that the 
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« less than ,0 copies per cell. This equate, to gene, at the low end of interrnedilte 

abundance (see table 1 ). The main advantages of the CCLS approach are ™«TiI ' 

rapid, technically simple and also produces fewer false positive, than 

differential expression analysis method,. Howler, like the physical separarion 

protocol, , major drawback with CCLS i, the large amount of starting 

required (at least 10 « RNA). Consequently, the technique has recency bTen 

refined so that a renewable source of RNA can be generated. The degenerate random 

ohgonucleot.de primed (DROP) adaptation (Hampson et ai m!. Hamp^n «d 

Hampson 1997) use, random hexanudeotide sequences to pnme .olid phaTe- 
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st«nd Cured for ^ lE^T? °«d, and the a„u,en,e 

the ,ame DROP prod^ ^ ^ dr,v « «» •» generated from 

UP - M d do^ lated zszs&zzz- <••<• < - 

Representational Difference Anaivsis rRB-tJ 

RDA of cDNA (Hubank and Schatz 1994) ,„ 
origmally .p Dli ed to genomic DNA aTa mea^ of *" teClm ' qUe 

two complex genome, (Li.it.vr. « 2 ' " '^ bBWf — 

amplification involving ,ubt«ctive hvbridJtion of me'LT *" b,raCtion » d 
exce,, dnver. Sequence, in the te«'er thaTC Lt ™ f 

rendered unamplifiable wherea, tha,™ homologue, ,n the driver are 

ability to be amplified bv PC^e procedu^"^ "* " ** ~ *■ 
In «.ence. the ^S^S^L? ^ S ? en ™^ * **~ *■ 
and amplified bv PC R MoVZ ^v.;, * P ? lat,0M " e fir « convened to cDNA 
removed from both - ZZ" 'T^ ° fan _ ad »P«>r .The adaptor, are then 

hybridized together in a ra" I ^iX' T *° d 
homohybrida have 5 adaptor, at each «J 1 S * *^**> ^'""on. °n«y tester:t»ter 
in « both 3- end, Hence" oSyTe e mo^ut St^^** l " ,B * 

the subsequent PCR sten *lrh A ..<»h * 1 amplified exponentially during 

«"* mung bean n-eiTbXiTSS.^ "^^ ""'^ 
homohvbrid,.TheadaDt 0 «« n ,k r T ! PCR - ennch n«nt of the testerrtester 

*e whole ^ thCn reP,iCed "* 

driver (Hubank and Shatz J ri^^" 1 ^""" f 
> : ^0»fc,d Ma eeo»d.tUrd^ fJ^ B SS ° f ,:4 °°- li8000 ° » d 

adaptor, are liwted to the te^ber~ . ""'^ Differenl 

^.Wfc^JSK""? of hvbridizari „ M d 
sub,e qu e„t amplificanon, The ^^^f^^^T*"** 
gene product, ea,i,y ob , ervable on „ bromSe 

■^T^Xi^z that !t - t- nj> - reproducib,e -* 

reported that they were abl, r« nM r llyeXpre$Sed « enc *- H »°ank and Schatx (1994) 
suLunti^ly £ Z£ y 0 fl^l d ^re„nally exprealed in 

main dmnLTb *i mtltmS 2 Lt Tt " ^ ^ " derived " the 
digestion are requ^d TfitS S ' hybrid ™™< „ d 

differentiaJ dinll^,,^ P T * ^ ^ fcngthier *" ■«» 
error to occ^^h^t ^ 

been,olved w,omed«reebvO'V 11 A has been noted . *i« h» 
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<fc control (driver) cDNA 



Attest (tester) cONA 
Digest wdhresmon enzyme \ 



Li gate to 

flecnojoftofytaied 
12/24 aoaotor 

strands 



MM12mer 



RBmJenas(TaQ).aoa 
)and 

ampMy 



Digest 



Mix 100:1. rneft and hytndi» 



T. 



i 



i 



i 



i 



Digest and Jigate 
new 12?4 aoaotor 



j 



Hlmenos.addpnmer(_ )andamp«y 



No amplification 



A tcnm^ — T.^T"* rUK - foUowing which the" l-^« "«V^' 7,l "™ Be - " Cl * eDNA 

J^r^** ^^.?£sss mat 

(1993) «nd Hub«k md Sdu« (1994). " " P^.q." Scribed by Lu ittyn 



Differential gene expression 



oo 5 



>:DNA 



-/igest and irgaxe 
lew 12?4 aoaotor 



icancn 



yer and tester cDN'A are 
" ot * adaptor ttrandi 

* productt. The 12mer ti 
pon-meraae. Each cDNA 
: adapton n removed with 
ic ampiined teiter cDNA 
«* of driver. The IJmer 
out u I( h pnmer» identical 
' * hlch "« "PonemiaJIv 
CR. siDNA product* are 
-c: '. Thii is digested and a 
"i from the hybridiiation 
icscnbed by Liamyn et ai 



Suppress, n PCR Subtr active Hybridization (SSH, 

«* diff „ en ^ y -4rsrs tarsus*?- - 

Equalization occurs since m ., H i;. H rarc and abundant messages. 

1 985). The rw 0 pri^££ i ^J* ,le,,e, °[ hybr,d, ""° n 1 — Higgins 

smgle stranded complement/™ "JT^ v L * " Cp pemms the ™n«Iing of 

hvbridiurie* iStadS^ """"^ *° not hybndiae in the primU 

there are ' 'T'"" ** PCR amplific "™- AwTugh 

the second^- h^ndSion mrZ " "»>^ule, present* 

expressed ,„ the \^Ds\Z^o L IT COmbin '»°« (differentially 

adaptor,) can ampiirT exp^enS C ° mp " men «'>- »™- ""in* different 

of ^ eren " al d "? ~° « liable if d nin, 

competent ceils T^fo™Id T "** ° f the * nal PCR incJ 

^^J^^JSS^^. then b < and their ituert, 

PCRproducacanbJ r^oSr ^ > S,S W PCR " Al "™atively. the final 

and cLed. ^£rSi? £ * *! bands ««"d. re^plified 

However. liiM/SLSZSjT ^"'^ T"" ^ ,W tUne 
donmg of smalle ^E^oTri ^ ? * ^ ""^ 
conta,„ , representative seiecno of P ° puUl, « n of c '<>"« probably n t 

equaluariontheorencXTccur *" addition ' ahho «*h 

by no means JSS^S^T WHU f Ubon »'7 * 

of clone,. Thus, in order to obinVaS,^ ^ reprMented « «he final populati n 
actually demonstrate differ^ • r ° P ° rt, ° n O, " th0 " * ene ,f>eci • *« 
cione, that will have «oS K " ™" Populanon. the number f 

approach » in.nallv more tun ~" ^T'' * *»«■ 
would appear to offeTbet te 7 \ 0SD "T 7 "f '"k^' demand «8- However, it 
producu In addition f ^ c i«»«»ng larger and low abundance gel 

different pr^^^^o^T * Tw^ 8 $KP *" d ^tiafes 
later,. Inl, way . 0 ; Tf <«^»W-a. - 

identified can be achieved Umbef ° f C, ° nM t0 be iso, "«l 

characterization, or a DNA amv tSlt " !, J C,0n " *» further 

has been used in this laboraTo" t0 ^el u * * ^ gene »' SSH 
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TttfarcDNAwfth adaptor 1 
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Pnactas) 



TtttarcONAwitti adaptor 2 




sample,, add fro* denatured ^ ^ 
«.b.c,d* t 






I Add priimri and 

*.d noampimcabon 

6 """^cation . suppressed due to 

formation of panhandle tincture 

5 amplification - 

p. • ttponenttel arnpOfication 

form typ, c molrcuio with the drivJTlTS™. ^ J t" «pr«Kd 

.« , °,* Jl0W <Unh " «nrichm Mt of «^«Sly « Dr i.T r dnVer « *° 
run. i7»e final products can be vuu*Ji*~4 . uenuy amplified vuing two round* of 

_ _« «A (IW4). with pemuwoiT «^I>«U* D,«chenko « «/. (1996) and Cur^I 



Differential gene expression 



srcOMA with adaptor 2 

2Z2 



Control ammao 



| . Treated ammatt 



J 



22- 
22U 



due to 
ure 



exccn of driver cDNA U 
ed ana allowed to hybndixe 
id abundant molecule*; and 
not ditTerenti&lly cxpreued 
jiwnon. the two primary 
ed driver can al«o be added 
•quencei. Type e molecule* 
iphned uung rwo round* of 
-•d. directly or cloned into a 

al. (1996) and Giinkayt 



Ex»mRNAfroni 
tissue of interest 
e.Q. fever 



Exffaa mRNA from | 
bssue of interest . ! 
e.g. Uvef 



taamrtaorant 



DnaseMreatment 



Convert to cONA 



J. 



Convert to cONA 



Complex probe tor 



j Hytndaaixxi. subtraction and ratification 
"^"W dnvmg tester tor uxeguiauo genes 



driving control tor GtMrrwegulsed genes 



i Comwexorooe 
*! tor screening 
I ctones 



Run out products on agarose gel 



Extract mdrvidual bancs and done in 
T/A vector 



Screen using standard 
and HA agarose 



-L 



PCR of 5-10 done 

cultures per 
extracted band 



Different ctones Booed 
and screened with up- 
regulated genes 



Screen using standard 
and HA agarose 



Ptasmtf mim-preps 
of selected donas I I 



Drfferenoaliy expressed 
ctones saected 



Seouencing and 
•oennficaoon 



Different ctones btoned 
and screened with down* 
reguiated genes 



new compound, bTe^D^ ch »"~~",on of the toxic potential of 
^..—n, u T P g he * ene -«P«ssion profiles they elicit with th 

ssSSSESSSS 
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amer,»«„| dlt()U> , ^ PC £? * , lver , UMd , 8 

species) and produces 48 ud- and 17 j i ? ' 3 ,n the w < a *«Mitive 

res.stant speeds (Rocken Swales t^ Tr*?"'* fa the «•*»■» P* • 

One of these gene,. CD81* uas up.feeu «ed k ^ Unpublished <»b«n-at!o5,). 
*u,nea pig fo.,owi„ g Wv. 4 ,W 3 treimS CD8 in™ do * n -' e ^»ed i- the 
a wdelv expressed cell surface, pro" Z hil's '''""^'v «*ntd TAPA-I) i, 
processes .nduding adhe^on I ' T ^ ' ^ nu ">"er of tellular 

* 1998). Sl „ce al.of these are S^T" ^ < L ~* « 

probably m ech« 1 sncaIK--re^a„rth a tCDri SenCS,S - " " mm *"n«- «nd 

m a resistant and *m*^£^J™ e ** , V on ,$ d ^"nnal.v reflated 

that the m„o„ry 0 f genes c^i^^^'^tT^ * apP ™ Ch » 
the Utter are predominant 

unknown function, thus p^S^TZiS " T?~ * C ° mp,ete,y 
cntica! genes of genuine biolo B ical im.7" v [ ° VeraI ' of th 

fumiona, identiLtion ^ht^^ ^ of complete 

essentially provides a 'molecular fin«™rint^ ' RCne profi,in * 
thereby serving as a mecht^ca 5^"&* 
investigations. ^^ pianorm for further detailed 

Differential Display (DD) - 

. ... m °re comrn nly referred t as • differentia 



Differential gene expression 



Table 



•ltmeni with WY- 14.643 or 
was used to generate the 
(Clontech). Lane: 1— Ikb 
les downregulated following 
rbiuJ treaancm: S— genet 
Reproduced from Rockett et 



>btained. For example, 
nogen Wy. J 4.643, up- 
in the rat (a sensitive 
s in the guinea pig, a 
blished observations), 
down-regulated in the 
ely named TAPA-1) is 
irge number of cellular 
inerehtiation (Levy et 
:enr in the phenomena 
:t is intriguing, and 
::nerennaJly regulated 
ae oi this approach ts 
atabase sequences, but 
genes of completely 
rail assessment of the 
g the lack of complete 
?ene profiling studies 
• xenobiotic challenge, 
for further detailed 



Genet up-regulated in rat i.ver follow,ng*;-4ay exposure to phenooaro.ta. 



r 



Band number 
(approximate 
tut in bp) 



Highest sequence 
auruianrv 



FASTA-EMBL gene identification 



5 (1300 t 
7(1000) 

8 (950) 
10 (850) 
U (800) 

12(750) 

15(600) 

16(55) 

21 (350) 



93.5% 
95.1% 

98 J \ 
95.7% 
Clone 1 94.9% 
Clone 2 75.3 % 
93.8% 

92.9% 

Clone 1 95.2% 
Oont 2 93.6% 
99.3% 



CYP2BI 

Preproalbumin 

Serum albumin mRNA 

NCUCCAP-Prl H s«pinuiE5Tl 

CYP2B1 

CYP2BI 

CYP2B2 

TRPM-2 mRNA 
Sulfated glycoprotein 
Preproalbumin 
Scrum albumin mRNA 
CYP2BI 

Haptoflobulin mRNA partial alpha 
18S. 5.SS&28S rRNa 



Table 3. 



Genes down. regulated in rat liver following 3. day exposure to phenobarbiul. 



Band number 
(approximate 
sue w bp) 



Highest tequence 
similarity 



FASTA-EMBL gene identification 



1 (1500) 
2(1200) 
3(1000) 
7(700) 



8 (650) 

9(600) 

10(550) 
11 (525) 

12 *375) 

13 (23) 



Clone 1 
Clone 2 
Clone 3 
Clone 1 
Clone 2 
Clone 1 
Clone 2 



95.3* 
92 J « 

9I.7< 
77.2" 
94.5 e 
9|.0 C 
86.9* 



14(170) 
15(140) 
Other*: (300) 
(275) 



96.2% 
86.9% 
82.0% 
73.3% 
95.7% 
100.0% 
Clone I 97.2 *> . 
Clone 2 100.0% 
Clone 3 100.0% 
96.0 % 
97.3% 
96.7% 
93.P 



3-oxoacyLCoA thiolase 
Hemopoxm mRNA 
AlphaOu.giobuhn mRNA 
M.mtusnUui CI inhibitor 
Electron transfer flavoprotein 
-V. musculus Topoiiomerase 1 (Topo 1) 
Soares 2NbMT M musculus (EST) 
Alpha-2u.globulin u-rypet mRNA 
Soares mouse NML M. musculus (EST) 
Soares p3NMF 19.5 M. musculus (EST) 
Soares mouse NML M. musculus (EST) 
NCLCGAP-Prt H. sapiens (EST) 
Ribosomai protein 

Sotres mouse embn-o NbMEl35 (EST* 
Fibnnogen B-octa-cnain 
A poll pop rot em E gene 
Soares p3NMF!9.5 M. musculo* (EST) 
Stratagene mouse testis (EST) 
A. norvttxcus RASP I mRNA 
Soares mome mammary gland (EST) 



— " " — ' — j yiwiu I I f 



display (DD). In this method, all the mRNA species in the control and treated cell 
y primed PCR ' (Liang Tpt Sb? * mpl ! fied m $e P ar " e reactions using reverse transcriptase-PCR 

rred to as ' differential l«w-rCK) The products are then run side-by-side on sequencing geU. Th te 

bands wh.ch are present in one display only, op which are much more interne in ne 
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- be earned out-2 dav,^ otoL. 2!l12r TT " ** ,P " d W ,th wh,ch » c » 
clone,. * W ° 5ttm * dwpl *> " a. a week to make and identirv 

revel™" * - P™«- • 

« the 3 -end. e.g. s« (V^^j^^^^^^^'^n' 
arbitrary primer mav be ^iorJ^rtJ?*" 19921 Al«n»t,v.|v. ,„ 
This variant of R\i fi^T •« strand cDNA »ynthe,« (WeUh tt at 1 W"» 

P ™ W One*v^ «?* 

derived from anywhere m^KsTSSS * PPr ° ach " that PCR •»*•«, mav be 

(NVong and McCleUand imxTK^ h £^ Mm ">' ta ««W«*XA. 
d «»~™tion..econd,trandcDNA.y^iru ?* tran ""P«on and 

iortorary primer, have , ,~Ve hTa" e"ch DO ™ " Tp*»« 
pnmen. which contain a mixture ofaJI four h ? \ " com P» red "> random 
PCR thu,. produce, a ^ 
length and compo.ition. pahuL^?^ W ? m * ™ **™ 

dT-anchor, and arbitrary primer, are u,!d all „ »\ f°«"b.na t ,on of different 
be amplified. When the cDN^aC f rora ^ "LS^ A Spec,M from a can 
suie by , lde on a Poly^l^r g t^lT n ^ SWm P ° pul W « analy,ed 
the appropriate band, recovered for" donT. ! ^ «P"«>on can be identified and 
Although DD u c,omn « «nd further analvsi,. 

(Sun «. 1*94). w , d , p ItXV'T » «P » ™% of««, 
common o(u»i^ U cri„ din "'J?*', C "' u * e ' (Sun ? D «™»» IW>, 
-. <S«np. r „c „ „. tSfiZZZZZL T"^ — 

.md« by U „, „ rt (1995) .and wSTS^ 1 ?" 1 " <" M » 




d may be rcc vcrcd f r 
peed with which it can 
:k to mike and identify 

ethods of priming the 
with a 2- base 'anchor' 
92). Alternatively, m 
is (Welsh et al. 1992). 
AP* (RNA Arbitrarily 
PCR products may be 
frames. In addition, it 
lany bacterial mRNAs 
erse transcription and 
:th an arbitrary primer 
compared to random 
jsition). The resulting 
on the system (primer 
wily includes 50-100 
"nbination of different 
species from a cell can 
•pulations are analysed 
l can be identified and 
lysis. 

1 today for identifying 
ceived disadvantages: 

iRNAs (Bertioli et al, 
id the isolation of very 
stances <Guimeraes et 

y end of the mRNA 
or always be the case 
uded in Genbank and 
DD cannot always be 

to. 

piay often cannot be 
m up to 70 ° c of cases 
reduce false positives, 
and Denman 1997), 
urse I'Burn et al 1994) 
■ced and two induced 
ported that the use of 
positives arising from 

aknesses of the DD 
-oi (1996) and from 



Differential gene expression 
mRNA 



«JT„)CA; AC 



1" strand cONA 
<«- AC. 

'JGAAAAAAA 



•AAAAAAAA 

Aronrafy onmer 




l"strano cONA 




-AAAAAAA 



Denature and symnestse 2" strand 
wfln any aroitrary pnmer ( j 



2* strand cDNA 



2-° strand cONA 
— ► 



CONA can now be amplified by PCR using onginai pnmer pair 

K.gureS Two approaches to differential di.ptav <DD> an.lv... !- „„„h w 

either w,,h a polydT u N*N pnmer (where' sYc r 1 x \ ? can b« earned out 

different combinanon. of G C*nd A to anchor fh* *Z T**! "L"*"* pnnw - — *< 
ofthenu.ontyofpo^envU.e"^ 

Place, along the length of the mRNA. illCMSu * " °' ^ 

or more point, in the «me gene. In both cL« ~ £?A * '° ^ " 

pnmer. Since the*. .rbttrJv p„ mcrt foVthe - ill T " me4 ° Ut W,th « 
« • number of different p..c«. j£ ^ ,l4 °^ ybndlze to «« eDXA 

bmdmg point of the I" „rand pnmcr Fo bZ - ^ j '"t^ m * V * obt4mwl f ™ — 
used to amp,*- tecond ^ • '^" d '7k Wlt ' ^ 0^l^n * , ™ ° f >~ 

•mpluwd. proaucu. Httn the r«uh thai numerous rent sequence* are 

Restriction endonudease-facilita.ed analysis of 8 ene expression 

Serial Analysis of Gene Expression (SAGE ) 

(VefcuTeV™ ZTlilYl 0 -??™ "I th : Md ° { differemial dis "^ - SAGE analvsi. 

nacieotide sequences (»«-') of-onlv nin» «, in w . ' * h n 

information to identify their Z£Z « • • c pf0vide 
together in a ,eri!!?«/,K « ene 11 8f . on «"»- Secondly, concaton.rion (linking 
Jikdtai V™«i T ' Se « uenein * «* "-Ml* cDNAs within ! 
pr^edte JoSle ^^T^T"* 0 ' 1 °' ^ ^ ^ 
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populations arc then lifted and rK^«^ *-*p nirco c u >» A are released. The rw 
cleaved w,th the ^ fi< ?; ThC ampUSed Pr0duC « « 

the proce,.) and cloned. The »f7h' *" (eonc " omer » f««ned in 

a given rranscnpt i, identified ! is 1.3 Funhenn ° re - number ttime, 
abundance in the originTpopularion ^"^.^'emenc of that gene", 

mRNAi. has'not betn ^S^TffSl" ^ * bwA ™ 

on, y been used to examine ^ ^ ^ 

C*n* Expression Fingerprinting fCEFj 

J^^S^TSSLT\ appreach for ,soU " n * 

method. RNA i, convened » cDV A T' 10(1 Bel>avsk >' < ,99S >' 

cDXA PopulationT^^ f<° <dT) 

magnetic strepuvidin microbW* t„ f • * «donuclease and captured with 

products. t£J5£S^Z£S?* rem ° Val ° f UnW4nted 

c ^^ 1 '-*- fragment pool and^ h^lpi to ewur l^tnat^ch Rv'^ so' ? om ^' ex * t ' r <*e 

not more than one restriction r>«»».,~ a j pec,e * " re P"»ented by 

amplification of the c^d^ 

specific and one biotinvlated^poST prL.r T»Y ^-T^ OM ' daPl ° r - 
recapruredandthenon-biotrnvlated™^ Th ' 'eamphfied population is 
non-biotinvlated strand . tin r iX ? * a,kaHne di »-»«ation. The 
P~-^l*^rf.^^^^.Vf^"« * d »P»"P «fic 
ends are next seouenriallv treated whh , ^ofttS ,mm ° bil "«i 3' cDNA 

and the product, from each dite.uof.JSS tf^E^™™ 
composed of a number of ladders lcaU aJ to X n " * 

By comparing test versus control fin » 01 se " ucaaa » digest, u»ed>. 

expressed p^cts'S^ S^T^^^ ^ 
advantage, of procedure „ e tnat -^ ^1" ro l *? 
author, estimate that 80-93 % of cDN A Z,! * Bd re " roducib <«. ™e 
fingerprint The disadvantage 1 ,K„ . , W ,nvo,ved "» the final 

*- 3«M00 bandt-S 6 ^L^^*™™*™^*™ 
eanmated to be produced m an average ex^ rLL -k ° r m ° re which 
those d«cribed by Uirterlinden ^iSS^Z^ Tl ?.£? ** ^* " 
overcome this problem ' rada " fl/ (,991 > may help t 

■de.clibe^Tr^^ — t, was later 

digestion of the ^A^T" J T - Pf 
compared the profiles ot the Sm™i T'*'" autho " 

..manipulation » d ™te*=p 0 pu]aoons without further 



ch group. Incorporated 
srnction enzyme— one 
recognition sequence, 
■n with the IIS enzyme, 
ire released. The two 
amplified products are 
:atomers are formed in 
: hundreds of gene tags 
re, the number of times 
*ement of that gene's 
Iitates identification of 

hnical difficulty of the 
ased towards abundant 
nomic setting and has 
date. 



isolating differentially 
.•avslcy (1995). In this 
ligo(dT) primers. The 
tase and captured with 
• unwanted 5' digestion 
e the complexity of the 
ecies is represented by 
to facilitate subsequent 
out with one adaptor- 
nplined population is 
aline dissociation. The 
-erent adaptor-specific 
immobilized 3' cDNA 
■rnction endonucleases 
c rtsuit is a iingerpnnt 
cuentiai digests usedi. 
) .identify differentially 
gei and cioned. The 
reproducible, and the 
involved in the final 
an rarely resolve more 
0 or more which are 
se of 2-D gels such as 
(1991) may help to 

»e fragments was later 
instead of sequential 
. these authors simply 
•tions without further 



Differential gene expression 



1" strand cONA synmesa using 
fcoonytatec pory dT primers 



I 



captured wtft streptavmn oeaos 



CTAC 



-AAAA 



CI AC 



D^^ftarfandhgateinkaa \^ 



CATC- 
CTAC. 



CATC- 
CTAC , 



•AAAA /v 
-iTT7T«r \ 



CATC- 
CTAC 



-AAAA 
-T7TT7 



CATG 
CTAC- 



i 



Cleave with tagging enzyme (TE) 
ano proouce btunt enas 



GCUTQCATCXJCOCOCCa 
CCTACCTACXXXWOOOa 



CWTGCATOOCOOOOOOO 
CCTACCTACOOOOOOOOO 



TE AE 



| Ugate and amplify 



a;TAC^TAOCX)OOOCCaOOCCOOOOOGTACGTACG 



AE 



D.Tag 



AE 



Cleave mm A£ isaas oJTags. 
concaienaie. aone ana 
sequence 

AE 



^TCWXXXXWXOOCCOOOOOCATC XXXXXXXXXOQOOOXiaa^r 



Tag 1 . Tag 2 



Tag 3 Tag 4 



Portion lifted to . differ™ linta S3 ^^ull^m " ***** * "« -* 

TE). Rcmcnon w,th the rypc MS ^ L TT. , re,tnctlon < u imi enzyme. 

lifted »d amplified u.ing IMu.Im^*^ 1 "&?V* ^ o( *** 

which procew. conc.icnix.oon occun) and rfJ2L f„ ? ,,a * lhen W«n«t 

Vdcul«ai « a/. (1995), with pS" ^ * ° f Ch ° ,CC '° f After 
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up- or down-regulated in the tre.«rJ«.! !^ ^ C ° nnnn tiut the > - «" 
■aner proce,, Z^nJ^^g^. ""I™ 1 «— •■ Nonnalh, the 
the aforementioned ,tep, proTJe a J ? °\ Even of 

of ,ene expre,.ion, Thte^bltm, w«, Zt Z L^Vl °' " P ' d — *- 
so-called DNA arrav, (e.g Cres^, a/ I oo? add "»ec by the development f 
the introduction of which ha^La,^ !' ^ " ,99S ' Sci «n, «•-/. 1996). 
analy,i,. DNA arrays of^nlded " ^T^' « p ""' » 

hundred, or thou«nd, ofDvl slS ^ °/ ,hM Ch,p$ ' «»nta,n.ng 

a hum. gene. The gene, are often «Led b ° f mU ' t,p,e eop, « of P« ' 

jn oncoge„e,u. ^c^^t^E* °° prev,ou *'V P«ven involvement 

Human and mou»e arrav, « re alrLdv comm^l n ? *' Rt an,mal »«*««»• 
will con,truet a per,onal*ed J^^Z^Z^TS** "* ' ftW C ° mpaat " 
Rwearch Genetic, Inc. The technil,,/ ! ^ p,e C,ont « h Laboratori , and 
of gene, can be ^^TSS^^V" ^ " lh °"*» d ' 
population, can be labelled and u,ed dir , mRNA /«=DXA from the test 
appropriate hardware J T«fr^e ar^offer' " ^ — <** 

a,,e,, difference, in gene e«^,T on b.^ * T** ' nd ouant ""«v« mean, t 

can only be identincUVan^ 

(hence the term 'closed' ^ « in 

to combine an open and dosed *£^^^r d T^ W,ngmWVht 

quanmate the expression of know^n genes in m Rv7 » l ° f"^ 

sy«em ,uch a, SSH to isolate unknown / e „« ^ k P ° p " Iat, °'»- and an pen 

One of the main advantage, ofDxt a«« d,fferenria »y «P««ed. 

which can be put on a mS^ om 1" * T """^ ° f * ene *1» na 
60000 spot, on a ,ingle glass TwT? C ° mpan, « have re P«"ed gridding up to 
based m.cro-arrays w«l pr^J iUo^ «Sbl ^ ^ 

•terns in the n W future This ihoSJfacU^ I "* """"P"" 1 ""* off,th e .,helf 
different*! e.xpre,„on ,n tim an ^ i os r "if ^ d « e ^on of 

h.gh cost and the techn,caJ cornpiLZ" "^J™""- As ' de fr0m * * 

-ray,, the man, problem whieh e , ^J d produc>n « Probmg DNA 

Igene-ch.p, technolog.es. ,s that reX^re I? V " eWer "^""v 

arrays. However, th,, problem , Sr J? ^ "I Wh ° U> ' re P rodu «°'< between 
next few year,. bemg addr "" d and should be resolved within the 

id,TO«oioni,th.b««'»!><hoi> el lf.rt th« h.«„ T " ie °"' , (l>»»tiv« 
m™. of di«»»e„„ 8 ^ , J! ^'^^ "Pid»ddBci™ 



Differential gencuxpttuion 



•at ir takes a great deal 
m that they are indeed 
tissue. Normally, the 
3 CR. Even $o. each of 
goal of rapid analysis 
^y the development of 
5. Schena tt aL 1996). 
mtial gene expression 
iss 'chips' containing 
itiple copies of part of 
y proven involvement 
her cellular processes, 
ne and animal species, 
and a few companies 
:cch Laboratories and 
ids or even thousands 
cDNA from the test 
When analysed with 
quantitative means to 
ions. Of course, there 
hich are in the array 
h to elucidating the 
pment system may be 
directly identify- and 
iations, and an open 
-•rennaUy expressed, 
her of gene fragments 
>oned gridding up to 
3e high densiry chip- 
roduced off-the-shelf 
oid determination of 
:ts. Aside from their 
? and probing DNA 
e newer rrucro-arrav 
eproaucibie between 
' e resoived within the 



messed genes 
:lones obtained from - 
il identity (putative 
= a rapid and efficient 
•"e pronles of gene- 
Adams tt aL (1991), 
imated that there are 
^presenting over half 



""TL" mf0mUO0n Md c,on « «■ » nMBy available rovalnwree 
from the ongtnaton) hu enabled the development of a new approach touardi 
ddferenoal gene exprenion andym as deacribed bv V„ m ,t,i, er cTTml The 

fr^„:^ e l E u S e T^ "nretussue c chccc bu, none or few 

overUppmTdTt.TI! J .4 , r °; ram 7 ,M ,0 «"»" » *e assembly of such sets of 

.ntem« For e^L tK 7 • "^T " « mv «* °' irom < h < 

httpTw^ ri^ the ' nsntUte °««n'c Research (T1GR. found at 



Problem, „d p.,. nlW .f diff.,.otW ..p,...^ ,«„„;,„„ 

77i* Mmzr or hjy/# «// approach ? 

eacaiavemeirouTjoisnncnx-eceUooouianons -Uao ,n r«. - . " 

mere are almost aiw.vs „ fl ™,i - 7 m tneca!te «" neopiasuc tissue. 

hrnW/wwwncki Y , J . ttt( Cp^HF« more mforaution see web ait : 



676 , _ 

J- C. Rockett « al. 



e.g. fluorescence activated cell soron* rF*rc» , n u 

t 1 W) «d magnetic bead teZTo^ha ^ ^?of J K "* D «'- « 

- However,th«e taking a holuticanB^ ' ' " 8 " Ro * ,er « 1998). 
There « an equally .pp^ pr £ e v^ St " C ° Wider th » — • ummportant. 

««u« are complex mue. of different £ »»o con„aerat,on After all. „nce all 
regulate each other', grow* 1 " * h ' ch ™"»te.v 
some way contribute (po„«velv 7 ?" MCh c u,d '« 

-h.ch he behind rcpon*, to externa? ,Zu, ,OWard, , the mo,e ^ mechanic, 
then more informant to cam- ou oSe enTj d^r^"* gr ° Wth - " » P" h »P« 
opposed to .„ vitro model,, where uSS^^SS TT" ^ " «™» » 

.hou^^d" ^ bio,og,ca, v, ri , ti n 

» clear that individual. (human, wd aTmlT """^ mode,, are b ««* «»ed. It 
sumuli. One of the hJS^S^^^^^^ 
Polymon.hwn. which i. mediated bx cvwcH? » the d e bn,oqu,„e oxid 
?«»^*^«f«« y ««« B ^ p 22^ C \P- D6 »- determine, the 
Zanger 1 997). The re«on, for MuAdZ^^^T^ IW3 '^«^ 
vananons. regulatory reg,on pok-™™! T Vaned and COf "P>«. but all lie 

c» all contribute ~J£!*EZ^£ T", «"» 
should, therefore, be given ioaT^c^Z ^ re$ P° n ",. Careful thought 
value d poo, ing .u^l^f, ^'^«-«*.pJX 
benencal through the iroLg out of 2^^' ° f thU « * 

fluctuates of (mechanically) irrefevan7 .1 P ° nses and uni ™Portant min r 
prov,dmg , dearer overall picire of th^/ ? mdivid ^ *us 
r«pon,e. However, at the t ,m , ' h " ra ° ,eCU,ar m « b «»n» of J 

S , ^-« l ^«»«^^^J^? nalio, " ma >- be °f «tmo,t 
effect, of a given chemical/di,ea,e. ma,S 10 SUCCumb » «»' "«« the 

Z^^LT PTUn0n '"^ « - *** „ 

that mam. 

(Mechler and Rabbira 1981. H^ k ^/ ^"^1^*" 31 any 0ne «™ 
h.gh a, 20-30000 have al,o been Quoted A . ? V ° ' " 0) - a,thou * h "gum a, 
provided evidence ,ugge,tmgrtl? t ^!l ( ° L ,976K Hedri <* « «UlW4) 

. A breakdown oF*^^^^ • 

- -WWthe result, of **2£!5&£Z^ T ^ 
dataobumed P reviou.ly U ,mgotherme^hod!T h " Ve ^ ««P««d with 
«pre,,ed mRNAs are ^presented n ^fil^ TT' that n0t *" diff «'««i.iy 
(which ^portantly. often ^,^J%* 1 * pa " icu,a '« «• — g J 
"^•fferential display sy„em,. ThiH ^ Trn^Jo ,hTn reCOVeted 

— PopuUt, n ,)f rrec »!I=^^ 



. 1998. Kas-Deeien et 
. Rogier etai 1998). 
us issue unimportant, 
ing altered expression 
ion. After all. since all 
pes which intimately 
each cell type could in 
nolecular mechanisms 
growth. It is perhaps 
ments using in vivo as 
ientical cells probably 
nolecular changes that 

•al biological variation 
dels are being used.. It 
trcm ways to identical 
rbrisoquine oxidation 
6 and determines the 
nard 1993. Meyer and 
d . complex, but allelic 
cal and mental health . 
nses. Careful thought 
dy and to the possible 
;ffect of this can be 
id unimportant minor 
WduaJ animals, thus 
r mechanisms of the 
is may be of utmost 
:cumb to or resist the 



a ni$h percentage of 

' -egcstmg that mam- 
secies a; any one time 
f. although rigures as 
Hednck et al. (1984) 
to the rare abundance 
n table 1. 

been compared with 
« not ail differentially 
ocular, rare messages 
not easily recovered 
as the majority of 
population (table 1). 
lates (heterogeneous 
able to detect mRXA 



Differential gene expression 

t>T? 

species present at leu than 1 of th* t^i «dv » 

intermediateor,bu„dam"pe^ n-*ou,v a lenttoan 
urge: only) we , T' / ,nt * re * tm « , > • «*«> simple model systems i., ng | e 
P-ersc^^ Population, the same 

are probablv b««S^r* *T " W °°° * *™»«- These result, 

Produce prod^ d ^D d r ^ 0 T PeaU ° n ** ™> PCR 

The numbers of differentially expressed mR\ a, f(BBn ^ . t , 

up to 1 000 or Zn^Sfi, T*" Pr ° dUCed b> ' * ,ven ™<nmali.n cell. 
Whilst this Ly £Te*«Le fi^: XPre T n f ° ,,0W,n8 Chem,Cal ««**«. 
acavated/upregulaVedln luST/T n M* fal0WB that at lea " 100 

cell, differentially express up "43 HZ 1 h " ,nterf r° n - : - $t,mu, " ed HeL, 
expressed by the cell,) However there £v i"""™* l?™ mRX * 
■nywhere near the recove" of these nu^h r pubhca " ons ^umenring 

normal and r^^Z^nS^tZ ^T^^^i 
total bands to be different Of the.t n. „• ( 3) found on, y /0 f «000 
differentially express Id Ld, ^n , V nXT JT" " ^ " d » 
female rat liver following ethinvl «i """^ 10 u P re ^*d in 

identified 14 diff Z l TollluZ I McK «".e and Drake (1997) 

myristate acetate PNlTI ' " PreSS '° n ~» altered b ? P« rbo 

rnyelomonocy^eU l!^. iES JSC * ^ 

products whose expression was'upreg^teHn the 1 K ft, 10 8 « n 
allege disease sufferers. Linsken, a " f, 995! £ np i f , S b '°° d ,euk ^« * 
expressed between young and senescent fibroM t ^ differ «™»y 
have also provided an aooarTnt o!u^ ^ J Techn,c ' u « "her than DD 

for example. cJ^Wh^S ^Z^* e * P ~ d L " n « *H 

cancer compared to ^^f'^;" "pressed inc lorectal 

? ene, U p re gul,ted ,n ^S^t^ST- " "- ( ' ^ ,? 
clorlbrare: Philips „ a/.„99 0 > m^^^^ZT"^ 9 " 1 ''^' 
mghly metastatic mammary ***acj^^^**** *V "Plated m 
sunc ones. Prashar and Weissman ( 1 9^2 J x to poorly met.- . 

identified approximately 40 Ten« I , fe5fnct,on «*«vsi» and 

act,v, tl0 nof j urk aTT.«lls Groe , ""^ " Pre " ion within 4 « ' 

fragments isolated using SSHo^d , " / I °" 6) 27 t— 

and fou^d only ,2 to Zu^^** ° f Iiv « deration 

tosnow*^^ 

analysis of the ^^^^^^ ^ «® " "e confirmed by 
. -Whilst the latest m^^^^^^^l^ 9 ^ «' 
- and CDfnmental modifications v oZ-^TTV^™* *° ,hdude desi «" 
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produceonJy • .null number o f diffeT^K $0me mode,s wi « «enuinelv 

*o techmcaJ problem, ^ c» r^cffic "^T" genM Ifs »^«.on the " „e 

baued ,v,tem«. 1„ .ddirion. it „ toou^'"^ the " ^PHfication bv>C R ! 

(Gn^nmkand Uegwater-joo^ -.bb^f^ - ^ reported ^ SSH 

- Ul 1 u l«« n between 



ally- effective — pr ving 
number* of artificial 
e rare messages already 
models will genuinely 
s. In addition, there are 
pie. mRNAs may have 
amplification by PCR- 
circumstances not all 
evclopment. deadenyl- 
i Steitz 1998). whilst 
Hsp70 (and perhaps. 
:avalle etai. 1994). The 
e efficient* of systems 
cy of any system also 
tial display techniques 
-> isolate mRNA that is 
are used to prime first 
ribed to some degree 
It has been shown, at 
can lead to inefficient 
Subtraction kit user 
o likewise in other SH 
:tion amplification step 
me sequences amplify 

the temporal factor. It 
iiy interrogate a cell at 
genes showing altered 
disease processes and 
iscades of signalling. 
ies which are switched 
vital information may 
information about the 
?y can be derived for 
:cuiar interest to the 
- time point analysis is 
men, of course, adds 



ssue of how large the 
gene in question with 
:he isolation of genes 
reported using SSH 
>nstrating a change in 
*ere is a ' grey rone'— - 
of isolation between 



Differential gene expression 
Z£r«X£" ib " D ^° n J* ° ,her hand - » n0t ^ » "»» ,rev 

-fed JSZS^^l^i,"^ that di " erenCM ,n ^ 

" teems highly improbable with current terhn«i„«. .k . 
developed that is able to resolve all „!!I^ C that a gel $ >' $,em cou,d be 
given test svstem fl* °, T^^D^r^T™' express,on «" »" v 

(PAGE, c^^^lZ^to ? ^r l ? ld [ «*' 
u«da.^d,rdmDDe^enme„u.EC,^ ^ his cl ar " f ,989,and - 
products such as those seen in a DD will con AV« , "complex series 01 gene 

-hatappeantobeonebandmagelrnt! infae^ UnrMolvab,e componenu. Thus, 
been well documented (Ma^eu-Da^e r 2 ST !" -I* T"" ,ndeed ' " h " 
band extracted from a DD often reprMentt^col Smith « a l. 1 997) that a „ ng le 
and the me ha. been found fe?S S ^ 

1997). One possible soluTn wa, offered bv \l \ " S*™"* <R ° Ckelt " 
extracted andreamplined^ d idat e h^H f > Ma * ,eu ' Da «°> « (1996). who 
conformation po.vmoS Jssc* nlT ' °° ^ U " d "™ d 
repented me^rulKdSt:^ •«* «"P— * 

high resolution ag^g U s^L^ ~«*. 
HR (National DUgnosi H " ,le L ! rS wh f ^ L ' K) A « uaPor 

than PACE, can onlv separate DV4 T t0 prepare and manipulate 

(15-20 b»e paSTfor ^V^'T^ dUfcr in I,Ze b >' 
products which dS r P ,n?^l liw than^aL ^ * DA " 0ther ~* 
However, a simple technioue Ll !« , " UOt *" n0rmall >' not "solvable. 

AGE-the inclusion of St It ?" r™" 8 *« r " 0,v,n « P°«« <>' 
<b,sbenaam,de.PEC wft^^ "A-yellow 
gel separates identical or clo^lv «„.a ' -/ omDrt - Bremen, Germany) in a 

HA-red and ^JZ^Tj* GC^and Tt"^^ & ^ Uy ' 
•Wtwer « «/. 1995. Hanse Wnk icq- I T DNA moms - '"Pectively 

HA-stain, possess M o ^ 2. ,99 '- pe " ona ' conwiumcauoni. Since both 
-hen an electric neTd , apihed ^ 1 lhe 

■* neganvelv charged and 'ntreforT^, " ™ ° PPOS, " 0n 10 DNA ' wh «h 

DNA Cone, are SenS t Je as o!"'" * e Thua « if ^ 

agarose gel), but differ ,n "t/Gc" - ' S " ndard high reaolu « on 

will effectively reu d \h ^1 ^T 1, 7 d I T n ° f ' HA " dye in lhe ««« 
other, effectively making , £1 r" 1 "^ C ° mpared to » e 

differentiating between ^^^7^' Z^T* ' mM,U f 
seouencM witK s „at /, rlA-red has been shown to resolv 

ffh^^ : 1 % « » W>. whilst H^se . 

to distinguish tw'o ^S^^Z^l^* ^ " CMC h W " U,ed 
(Hanse AnaKtik 1996 oerson/l ,11 l by ° nl> " a ,,ng,e P° int mu «» n 
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tgei o). which tepamei identicallv.iiw nv 17 ' tff A) How * v «. the presence of HA J-l 

:'""""»"•• *• «oh.»n of PACE ™ ^1^, i d r«'^ P : 0b '' m iM> "' » «"»• 
( I W) «d H«,d, „ , ,„„ D W ' ■"'"""W^.crib.d bv Uln3lnd«, '„ 





I.A.red. Bands of dccremng 
i lubiractivt hybridization 
each cloned band and their 
high resolution 2 ° 0 agarose 
•ed. With tew exceptions, all 
■ er. the presence of HA-red 
he percentage of CC within 
:ies within each band. For 
>c ««e same site, at least four 



rd gel should indicate 
Id separare otherwise 
ase content. Geisinger 
entifying DD-denved 
is laboratory on clones 

he dinerentiaJ displav 
r:ec out in a standard 
ctec and incorporated 

there being different 
\T content. However, 
•again, one might use 
»GGE i or temperature 
he contents of a band, 
or on the reampiified 

jes to visualize large 
3blem in that, in terms 
ands. One approach to 
bed by L'itterhnden er 



Differential gene expression 

Extracrion of differentially expressed bands from a gel can be complex smce ,n 
some cue, (e.g. DD. CEF), the result, are dualized by auroradiograph.c means 

o»d fLT" overla 7 f T d r ioped fikn on the * el ™« *«* 

acTouT.fnr™ """I' £ ° r f"*? C'«arlr. a m,s,udged extr.ct.on can 

TL V ^ ' ir0b,em - Md that » the «« * rad,o.sotopes. 

Semo^Trati Z •■ " F °' L°nmann er at. ,1995) 

demonstrated that a,lver staining can be u,ed directlv to v„„.l«e DD bands .n 
honaontal PAG. W al. (1996) avoided the uae of radio.sotopes bv franst rmg , 
small amount (20-3O-,) of the DNA from their DD to a nvton membrane and 

raVsf~T,J from the gel. Chen and Peck (1996) wen, one step further and 
trans erred the ennre DD to a nylon membrane. The DNA bands were tJen 
vwuahaed using-, digbxigenin (DIG) system (DIG was anached » the Hu5? 
pnmer. used in the different d.,pl,y procedure). M Z^.^£2 

One of the advantages of using techn.ques such as SSH and RD A is that the final 
display can be run on an agarose gel and the band, v,suaJized «uh simpt et„7d^ 

u.th SYBR Green I or S<i BR Gold nucleic acid stains (FMC) effectively enhance. 
rt.e mtens.ty and shyness of the bands. Th.s greatly a.ds .n the.r prec. eSlTn 
£L£? ^ Pr0dUCM lh " othe ™« ^e overlooked Wh 1," 

wa?Z"'i WSJ ,mned SYBR Green 1 « bener visualized u^ng ,h „ 
«ave ength UV (2,4 nm) rather than medium wavelength (306 nm) the fh rtJ) 

to damage DNA extracted under 254 nm irradiation, effectived preventing 

In7e^ b°. n nd 4nd ^ b ~ " » ~ with sVBR gS 

and extract band, under a med.um wavelength UV transillumination. 

The poaaible uae of 'microfingcrprinttng' to reduce complexity 
band Tak^ ™ T nUmb "° f gene P roduc » ™* Poss.ble complexity of each 
anl,W ir 2PPr0aCh t0 ™ P,d ch « ac ""»«on may be to use an enhanced 
anaJ>s.s of a small secnon o» a different d.splay-a sub-nneerpnnf or 'rrTcro. 

» a oTr^L" 0M C ° Uid C ° nCenWate ° n *~ b -° S S-h oniv r P pe!r 

m a particular cnosen s.xereg,on. Reaucing the nngerpnnt in : nis wav has at lea! 
rwo advantages. One ,s that « should be possible to use d.fferent^l n™, 
concentrates and run t.me, tailored exactlv to that region Current* 22 
run product, from ,00-3000- bp on the same gel. whichSs t 
gel system being used and consequently to suboptimal resolut.on. bo^e™ f 
.ue and number,, and can lead to prob.em, in the accurate excision of indSL tf 
b«ds. Secondly, .t may be possible to enhance resolut.on bv using a 2-D andvsU 

"« , ft ^c^ n • " d T", b : d earHef - ln ,Umma ^' ^ -8-f gene produ« ,L« 
«?7=» 1 mClU . ded Cemia * re,eVant ' « enes ' the 2 " D "andardixed. 

■ ^ulaTeff^ ;J-«fi«t.on of compounds which have s.mi.ar or widely different 
ceHuIar effects If the prognos., for exposure to one or more other chemical, which 
.duplay a ,un.lar. profile is rf-^Jk« ltt - jtal _i lls ^^ ^^ip, ^,«Srj3E 
effect, for any new compounds which show a similar micro-fingerprint 



682 



J- C. Rockttt et al. 



reaction analysis. Stress senes «rL»k7 * elect, j > " OI p CR pnmen and/or post. 

«enes. ^ochromes^O^eS.^- ^ H* 
for analysi, in ^ w , v £2?^ """"red « candidate. 
Atlas cDNA Expreaaion W se^eT) ^ ° NA ""^ (e « Chw«hV 
*roup,„ gtog ether g e„e,^^^ «* «• "me degree by 

damage response etc. oinerent response, e.g. apopto.i.. stre.j. DNA- 



Screening 

False positives 



The generation of false Dositiv~ u» u_ 
d^erentuldupiayco^^^^^ ^ '"«* *• 

1994. Sompayrac « al. 1995 S^JL , ,9 ' 3 - N » h '°«^ 1994.Sun« a /. 
technique being used. For insane? in S ? ^ """'^ Var,e * with 
J^HPWim^c»|J^^ *^ f.^P-n which have no, 
hganon event, (O'Neill and Sinclair hS^^^T ^•^^ 

!, C i' mfa ": « d «^tem.te transcription of r«S S lh '° Ugh 

ro be derived largelv from abunda™ „.„ . In SH - fa,,e Pownves app ar 

cDXA/mRNA specie, Jhich do SiSThlh !, gh ^ may ™ 
A quick screenin, of ^ ^ "^nd.zaaon for technical reasons. 

^•'^doidni^^^^^^^^c.nM ut 
from tester and driver mRN'A ire hvbriH " rand probes *>™h«ued 

tester probe, but not driver. iSe * ad van™ " fT"^ d ° neS WiU hybrid « e » 

may not generate detectable **' V ™ S a PP™ch is that rare species 

" to screen the clone. u.in* a u£S k T 6 ° pt,0n for those usin * SSH 

from which it w M derived* and w h I'S^Tf * UbtraCted cDN * 

reaction (ClonTechnique. 1997a) Since the SSH £°? rCVe " e ^ bt »™ « 

« should be pos.ib,e to a^ VS^^^^ 7" 

?ene.. Despite this quick .creenin. ! "preaennng low abundance 

original mRNA J JiS^^Z^S " *° ^ » *" 

approach. .Although this mav be achieved ^Z T\ i 3 ! "° re 

poor by today", high standards an on, ? -Northern blots, the sens.nvirv i, 

sen.itive detenninationM.ee I2 w) " °" meth ° dl for aCCU ™ » d 

Sequence analysis 

the .equence for analvJL of the DNA ^tirKT^ 6 * the ™ * 

confidence in the result-severJ I £nilt , ! * ^ ,eadl 10 • ^"ced 

-fences are-, Wide^^LSSliK 1 membefS who * DNA 
. P450ge„e,up e rfamil y CNeU^«;M996i S J ""L^ ** the 
almost identical to g ene *, rSly ^fr the - d ° ne idemified " 



-nine altered expression 
R primers and /or post- 
* receptors, cell cycling 
onsidered as candidates 
arrays (e.g. Clontechs 
:his to some degree by 
ipoptosis. stress. DNA- 



at length amongst the 
no etal. 1994. Sun etal. 
wives varies with the 
iaptors which have not 
ves through illegitimate 
they can arise through 
I. false positives appear 
i some may arise from 
: for technical reasons, 
ones can be carried out 
ind probes synthesized 
said clones ( Hedrick et 
iones will hybridize to 
•ach is that rare species 
>n for those using 55H 
i :he subtracted cDNA 
he reverse subtraction 
inches rare sequences, 
senting low abundance 
leed :o eo back to the 
a more quantitative 
^iots. :ne sensitivity is 
::hods ror accurate and 



nai products which are 
rabiy reduce the size of 
-inn leads to a reduced 
nembers whose DNA 
e.g. the cytochrome - 
•one identified as being 
• brother gene X, or its 
of a gene was isolated, 
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1 4.643 (fETSTjSS L] ^ P em,e P™ 110 ™ *uch a, Wv. 

A further problem associated^* SH techno^ " meCnan " m - 
before SH i, carried out r^eDS"!™!?.?^^ " redund * nc > >» «,«. cases 
*g«uon. Th,s i, imp"™ ^ 

n eSPeCUl,y " " h ' gh 

hybrids and te^SSTdS ? ^ SCqUenCes that «™> c "»- 
Furthermore dSLSSSSLtT * "J* ,Ubtract,on (Ko 19*,,. 

» term, of „vbriSo„3aml.^ m ' *T eDNA ™ y d,ffer —M"** 
or the other (1^!^^^^' ™>' »« do oe 

expressed cDVvf ™.v k 7 , " IOme ln * m *n* >*rom differentiallv 
ced^e H 0 w^ r oL/ f«T" ed '""I 8 SUbtr ' Cme WmdW „ pro'. 
conseouenceofZ,:^ «* • * • 

or more fragment, of different m„ K \h« nm "' *' Vmg ™ C ' m 

as separate band, on ^^^^^^^T^ 
redundancy and inching the number of redundani ^JSZT" 

of ^t degree 

-seq^^^ 

part.cu.arly relevant when Th o^Trd ^d^v aC " PtabIe? ^ ^'^ - 
similar sequences with comDleteK^ *' com P* ri *>«« »iv e 

--stobetoaJlocat^^ ^ o ^ ^ " ° 

.roup those between oO and . ^L'^^SS^^ 

Quantitative analysis 

cJid,r; e nTe^ 

e* P r„,ed 'or in " -»y d iff e«nri.,, y 

ana,y,i,U,popu.arapp ro ach^^^^ « « 

the m, Jor drawback with Northern blot, i, rL7±£tT 0 £Z nt ^ 
to detect rare sequences Sin~,K,~ • , ohen n °t »*n«nve enough 

abundance (see ublen'thr majon ? ° f "pressed in , cell are oflow 

--«hoTo7 ^ RT-PCR may be the 

somewhatmore^;;^^^^^ 
optimization of Son c^dirS 

high thr u^^^^^^^.^«^P^*m V 
P ^ R «y*tem, using muhtchannel pipette,. 96 + . well plates 
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mu,t first of all choo.e an JSS Stit?^ re,a " Vely BW,,vtd - 0ne 
example mterferon-wnm. firvTe , *, have been tr,ed in the P"<- »or 

hydrofolate reducme (DHFR. .viohle'r STbS^W, B °Z " i .TV" 
m, Murphy et al. 1990) hvno«nrKi^ t ■? iyyi A---microgJobuhn 
* 1998) and a n^X^* ?^ 

•undard ^ould not ch»« S leTel of ,997b) Ide »"> . an imerna , 
*uge in the cell cycle T^^hZ^T™ "V? ° f Ce » 

shown on numerou, occ«£rt, mat \he J^"™*. However, it ha, been 
. ««d by the reaearch cornnli^ 0 •„ fa« 1 m °" housekee P»* *«>« currentlv 
Afferent tiaaue, (Clon^^ 0 ;^ 1?*' ^ C ° ndilion « « 

luminary expenment, be ~™?oJol^lZ?*T^ *» pr " 

the.r suability for uae in the model svstem housek ~P'ng to euablish 

Interpretation of quantitative dau must alio K» ■ . , 

companng the li,ts 0 f gene, identified ^S^JfJ?" W " h ^ * 
gamm,,ght W towhyrwodirTerent,pecie^eact^ ex P re "'°" °ne can perhap, 
For example, rat, and nuce appear Stiv e to £? <fcwm way$ 10 .timuli. 
range of peroxi,omeproliferaton"hfl« S^ n «"»-««oto X ic effect, of a wid 
re,i,tant (Onon « J j mTSSt i t f"" 6 " and * uinea »e largely 
Makowska « </. 1992) A rimliifild T u mbU " 1987 > Lake " * ««». 199 
compare list, of up. and do"™! 1^!^™' * m ° Mat the rea "> n <«> W » » 
expressed in onl v one spec^?^ T^V" ° rdef t0 identif >- tho « "hich a« 

or protecnon. Of course, the situation is Kkeh 1 h, non -S enoro *'c carcinogenesi, 

there were one key gene protecting gumeV^ r """"'^ Perh »P* * 

upregulated 50 time, bv Pft " 8 !T n ° n -* enot °«c effect, and it wa, 

in the rat. However, sinc^^^ZZ^ ^ * *" *~ 

? ene may be overlooked. Jus, to c^ " ' «*" 
ooes not necessarily mean a b.ologicall v,mnlr? a n?7" T CO?a « e m c *Pre»,on 
true relevance of gene Y which 2^™5J£ ' CSamP ' e - What » lhe 

«d gene Z which shows onlv , £0 ' 
may find that historically gene Y h~ «£! k l "amines the literature ne 
fold by a number of ^S^^f?-^ " * ^'^^ ^60. 
appear 1«, ,ign ificant . However Tet^rTJ^ ^ 5 °- f ° ,d bcreaae ^ 

recorded a, having more ^mS^^^-^^^^ 
•ncre«e all the more exciting. PerhapTeven m or . " your S - fo,d 

increase has only been ,een iJS^SSS^ T n""'^ " if *" ^ W id 
chemical,. _ «> related neoprasm, or following treatment with related 

- Prtbrem, ^-^ e ^-di^ rca ^- ^ y approac - n — 

^^S^^^ of an easily obt^ble 

• -e, pmenta, procL r 



arive analysis is m re 
i interna] standard, the 
rule is often excessive, 
eds of gene species. The 
elatively involved. One 
change in the test cells 
een tried in the past, for 
:in (Heuval et ai 1994), 
Vong et ai. 1994), di- 
J-2-microglobulin 
sferase (HPRT, Foss et 
b). Ideally, an internal 
II regardless of cell age, 
Ji. However, it has been 
keeping genes currently 
znain conditions and in 
•*e. therefore, that pre- 
tping genes to establish 

ated with caution. By 
ession one can perhaps 
.vays to external stimuli, 
woxic effects of a wide 
d guinea pigs are largely 
Lake et aL 1989, 1993. 
the reason(s) why is to 
dentify those which are 
•wledge of the effects or 
enotoxic carcinogenesis 
>re complex. Perhaps if 
otoxic effects and it was 
up-regulated five times 
•i. :he importance of the 
:c change in exoression 
or exampie. what is the 
- a particular treatment, 
nines the literature one 
be up-regulated 40-60- 
50-fold increase would 
it gene 2 has never been . 
uch makes your 5-fold 
ig is if that same 5-fold 
I treatment with related 



of an easily obtainable 
\ « tc « animals/cells in 
rirnuli. However, it has 
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ZZZSZZZf* P~««- -hil,t .till valid. ,s much too complex 

i^^^::^^ wh,ch prec,ude the 

Furthermore, there are imoortam .11 J I h ° U Cftan * M 10 «P»»»««»- 

which diffe^ex^r^* 10 dliWt -"de-Prnent 

others' TiTSSES ot VuS * ""V™ *™*>™<" «rc,nogen, than 
Polymorphi,™",^^^ ^ """^ Ccu ™' 

or TGGE to the *eVe P k app,,cat,on of «quenc.ng. SSCP. DCCE 

abundance of mRN*7, , !™ « * 5 ' SpeC '" ° f wh " her *" incre " ed 
srabiliry. ° f ,nCreaied «"n»cription or mcreased mRN'A 



Conclusions 

thev^not SSStSZS^I ^ H**™" ^ » th « 

chm.nat.n^dl i^.Kn^V^'' ^ "T'^ *»■ 
analy,., to confirm the result r^?^,^ ' °" 1 No " hem /PCR 
projects mean, that over the next ^0 or so the L?" 0t - 8en ° me maPPing 
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carcin gemc effect. UTuIk differential displav technoW * 
these questions, it does provide a mrm ^Z i , teChn ° ,0 *> cannot hope to answer 
~ and functional studies ' d ™ncat,on. regulat rv 

cellular response, is aCs.t^^f J" " h "!" dW8 ** m ~hani«„ f 

of those genes and SSS^^tSST^ ^ regUlSt, ° n 41,(1 funCt, ° n 
display can be likened to a ,S phito^rTh K " " bWaCt **«™ul 
"me. Consider the Hm^SsSS^iT* ^ ° f * » m 

and condition of the aoo^beL ° U bait,e » nd *« Piemen: 

deduce how the battle prog^ t d w " v ^ " " ked " «* *" d 

Photograph.^ in,^,/^.^,*^ «** a» u did from a few stU , 

must find out the capWlSe, «d moJ I" * e bart,e " *« Hist nan 

officers, what the o^l^ "J £252 "s^"" ^ — — i"t 

terras the remauu of the^. e StS2£T mU " 
conditions exerted. Likewise^ m ^h«£ lhe P" V «»«R weather 

knockout technolog>.. th e analvai, of C S s!^ " ^ ^ teChn,qUe *' ,Uch - 
nme and dose response analyses Al«S*k £ Paihwavs - muM »° n «^y»» and 
importance of differential mmSS'i™. • ^.j*"" reV,ew has emphasized the 
the full impact of thT, "pp^acE S be "trtl ""J""™*™' * -lation J5 
funcuonal genomic, and proceomic (2 d£ * . comb '»™«> 

electrophoresis). Pro^^l^^ ^ ^""P* "*« 
changes resulting in differential mud \ recent """ion a, many of the 

levels, a, decribed *JZS££S£££? ° ^ Chan *« - -*NA 

Protein phosphorylation ^^^ , ^r WU ^T^ P ^ M ^^ 
Proteomictechnologie,forbv^ti^o1; feqU,re fUnCti ° naJ »r 

change, that occur in a edl5u2?^2 LdT T charac "™'* the genetic 
to chemical or biological u*uT JTSS"? f d 7 elo P m «» and in respon, 
prov.de a fingerprm? of e« iage of^v e e ^ent ' WiU 

term should help i„ the elucidation of soedfic and^ "1'°*"' " 1 "* 
types of chermcal/baological exposu^ Z I ease aTes^Th " 
tnerapeunc benefit, of undersomcon* such mole^ Thc poteaoal "edical and 
measurable. Amongst other thmKTuLTn m ° leCUlar Man ?" almost un- 
even, spedfic ^"5^ c^^2l2T» could **«e the family or 
and/or acuteness of that exposure tiJ i " eXP ° Sed to P lus *« length 

They may also help *-J?S£^£^ m °" treatmeT 
diagnostic test, for the c^lJ^ ^Z ^I " 

most efficacious treatment " neoplas.a_an a^sjain. perhap, indicate the 
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ABSTRACT The recent ability to sequence whole genomes 
allows ready access to all genetic material. The approaches 
outlined here allow automated analysis of sequence for the 
synthesis of optimal primers in an automated multiplex 
oligonucleotide synthesizer (AMOS). The efficiency is such 
that all ORFs for an organism can be amplified by PCR The 
resulting amplicons can be used directly in the construction of 
DNA arrays or can be cloned for a large variety of functional 
analyses. These tools allow a replacement of single-gene 
analysis with a highly efficient whole-genome analysis. 



The genome sequencing projects have generated and will 
continue to generate enormous amounts of sequence data. The 
genomes of Saccharomyces cerevisiae, Escherichia coli, Hae- 
mophilus influenzae (1), Mycoplasma genitalium (2), and Meth- 
anococcus jannaschii (3) have been completely sequenced. 
Other model organisms nave had substantial portions of their 
genomes sequenced as well, including the nematode Caeno- 
rhabditis elegans (4) and the small flowering plant Arabidopsis 
thaliana (5). This massive and increasing amount of sequence 
information allows the development of novel experimental 
approaches to identify gene function. 

One standard use of genome sequence data is to attempt to 
identify the functions of predicted open reading frames 
(ORFs) within the genome by comparison to genes of known 
function. Such a comparative analysis of all ORFs to existing 
sequence data is fast, simple, and requires no experimentation 
and is therefore a reasonable first step. While finding sequence 
homologies/motifs is not a substitute for experimentation, 
noting the presence of sequence homology and/or sequence 
motifs can be a useful first step in finding interesting genes, in 
designing experiments and, in some cases, predicting function. 
However, this type of analysis is frequently un informative. For 
example, over one-half of new ORFs in S. cerevisiae have no 
known function (6). If this is the case in a well studied organism 
such as yeast, the problem will be even worse in organisms that 
are less well studied or less manipulate. A large, experimen- 
tally determined gene function database would make homol- 
ogy /motif searches much more useful. 

Experimental analysis must be performed to thoroughly 
understand the biological function of a gene product. Scaling 
up from classical "cottage industry" one-gene-oriented ap- 
proaches to whole-genome analysis would be very expensive 
and laborious. It is clear that novel strategies are necessary to 
efficiently pursue the next phase of the genome projects — 
whole-genome experimental analysis to explore gene expres- 
sion, gene product function, and other genome functions. 
Model organisms, such as S. cerevisiae, will be extremely 
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important in the development of novel whole-genome analysis 
techniques and, subsequently, in improving our understanding 
of other more complex and less man ipul able organisms. 

The genome sequence can be systematically used as a tool 
to understand ORFs, gene product function, and other ge- 
nome regions. Toward this end, a directed strategy has been 
developed for exploiting sequence information as a means of 
providing information about biological function (Fig. 1). Ef- 
forts have been directed toward the amplification of each 
predicted ORF or any other region of the genome ranging 
from a few base pairs to several kilobase pairs. There are many 
uses for these amplicons — they can be cloned into standard 
vectors or specialized expression vectors, or can be cloned into 
other specialized vectors such as those used for two-hybrid 
analysis. The amplicons can also be used directly by, for 
example, arraying onto glass for expression analysis, for DNA 
binding assays, or for any direct DNA assay (7). As a pilot 
study, synthetic primers were made on the 96-well automated 
multiplex oligonucleotide synthesizer (AMOS) instrument (8) 
(Fig. 2). These oligonucleotides were used to amplify each 
ORF on yeast chromosome V. The current version of this 
instrument can synthesize three plates of 96 oligonucleotides 
each (25 bases) in an 8-hr day. The amplification of the entire 
set of PCR products was then analyzed by gel electrophoresis 
(Fig. 3). Successful amplification of the proper length product 
on the first attempt was 95%. This project demonstrates that 
one can go directly from sequence information to biological 
analysis in a truly automated, totally directed manner. 

These amplicons can be incorporated directly in arrays or 
the amplicons can be cloned. If the amplicons are to be cloned, 
novel sequences can be incorporated at the 5' end of the 
oligonucleotide to facilitate cloning. One potential problem 
with cloning PCR products is that the cloned amplicons may 
contain sequence alterations that diminish their utility. One 
option would be to resequence each individual amplicon. 
However, this is expensive, inefficient, and time consuming. A 
faster, more cost-effective, and more accurate approach is to 
apply comparative sequencing by denaturing HPLC (9). This 
method is capable of detecting a single base change in a 2-kb 
heteroduplex. Longer amplicons can be analyzed by use of 
appropriate restriction fragments. If any change is detected in 
a clone, an alternate clone of the same region can be analyzed. 
Modifying the system to allow high throughput analysis by 
denaturing HPLC is also relatively simple and straightforward. 

If amplicons are used directly on arrays without cloning, it 
is important to note that, even if single PCR product bands are 
observed on gels, the PCR products will be contaminated with 
various amounts of other sequences. This contamination has 
the potential to affect the results in, for example, expression 
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Fig. 1. Overview of systematic method for isolating individual 
genes. Sequence information is obtained automatically from sequence 
databases. The data are input into primer selection software specifi- 
cally designed to target ORFs as designated by database annotations. 
The output file containing the primer information is directly read by 
a high-throughput oligonucleotide synthesizer, which makes the oli- 
gonucleotides in 96-well plates (AMOS, automated multiplex oligo- 
nucleotide synthesizer). The forward and reverse primers are synthe- 
sized in the same location on separate plates to facilitate the down- 
stream handling of primers. The amp I icons are generated by PCR in 
96-well plates as well. 

analysis. On the other hand, direct use of the amplicons is 
much less labor intensive and greatly decreases the occurrence 
of mistakes in clone identification, a ubiquitous problem 
associated with large clone set archiving and retrieving. 

Any large-scale effort to capture each ORF within a genome 
must rely on automation if cost is to be minimized while 
efficiency is maximized. Toward that end, primers targeting 
ORFs were designed automatically using simple new scripts 
and existing primer selection software. These script-selected 
primer sequences were directly read by the high-throughput 
synthesizer and the forward and reverse primers were synthe- 
sized in separate plates in corresponding wells to facilitate 
automated pipetting and PCR amplifications. Each of the 
resulting PCR products, generated with minimum labor, con- 
tains a known, unique ORF. 

Large-scale genome analysts projects are dependent on 
newly emerging technologies to make the studies practical and 
economically feasible. For example, the cost of the primers, a 
significant issue in the past, has been reduced dramatically to 
make feasible this and other projects that require tens of 
thousands of oligonucleotides. Other methods of high- 
throughput analysis are also vital to the success of functional 
analysis projects, such as microarraying and oligonucleotide 
chip methods (10-14). 

Changes in attitude are also required. One of the major costs 
of commercial oligonucleotides is extensive quality control 
such that virtually 100% of the supplied oligonucleotides are 
successfully synthesized and work for their intended purpose. 
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Fig. 2. Overall approach for using database of a genome to direct 
biological analysis. The synthesis of the 6,000 ORFs (orfs) for each 
gene of S. cerrvisiae can be used in many applications utilizing both 
cloning and microarraying technology. 

Considerable cost reduction can be obtained by simply de- 
creasing the expected successful synthesis rate to 95-97%. One 
can then achieve faster and cheaper whole genome coverage by 
simply adding a single quality control at the end of the 
experiment and batching the failures for resynthesis. 
. The directed nature of the amplicon approach is of clear 
advantage. The sequence of each ORF is analyzed automati- 
cally, and unique specific primers are made to target each 
ORF. Thus, there is relatively little time or labor involved — for 
example, no random cloning and subsequent screening is 
required because each product is known. In the test system, 
primers for 240 ORFs from chromosome V were systematically 
synthesized, beginning from the left arm and continuing 
through to the right arm. At no point was there any manual 
analysis of sequence information to generate the collection. In 
many ways, now that the sequence is known, there is no need 
for the researcher to examine it. 

These amplicons can be arrayed and expression analysis can 
be done on all arrayed ORFs with a single hybridization (10). 
Those ORFs that display significant differential expression 
patterns under a given selection are easily identified without 
the laborious task of searching for and then sequencing a clone. 
Once scaled up, the procedure provides even greater returns 
on effort, because a single hybridization will ultimately provide 
a "snapshot" of the expression of all genes in the yeast genome. 
Thus, the limiting factor in whole genome analysis will not be 
the analysis process itself, but will instead be the ability of 
researchers to design and carry out experimental selections. 

Current expression and genetic analysis technologies are 
geared toward the analysis of single genes and are ill suited to 
analyze numerous genes under many conditions. Additional 
difficulties with current technologies include: the effort and 
expense required to analyze expression and make mutants, the 
potential duplication of effort if done by different laboratories, 
and the possibility of conflicting results obtained from differ- 
ent laboratories. In contrast, whole genome analysis not only 
is more efficient, it also provides data of much higher quality; 
all genes are assayed and compared in parallel under exactly 
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Fig. 3. Gel image of amplifications. Using the method described in Fig. 1, amplicons were generated for ORFs of 5. cerevisiae chromosome 
.V. One plate of 96 amplification reactions is shown. 



the same conditions. In addition, amplicons have many appli- 
cations beyond gene expression. For example, one recent 
approach is to incorporate a unique DNA sequence tag, 
synthesized as part of each gene specific primer, during 
amplification. The tags or molecular bar codes, when reintro- 
duced into the organism as a gene deletion or as a gene clone, 
can be used much more efficiently than individual mutations 
or clones because pools of tagged mutants or transformants 
can be analyzed in parallel. This parallel analysis is possible 
because the tags are readily and quantitatively amplified even 
in complex mixtures of tags (13). 

These ORF genome arrays and oligonucleotide tagged 
libraries can be used for many applications. Any conventional 
selection applied to a library that gives discrete or multiple 
products can use these technologies for a simple direct read- 
put. These include screens and selections for mutant comple- 
mentation, overexpression suppression (15, 16), second-site 
suppressors, synthetic lethality, drug target overexpression 
(17), two-hybrid screens (18), genome mismatch scanning (19), 
or recombination mapping. 

The genome projects have provided researchers with a vast 
amount of information. These data must be used efficiently 
and systematically to gain a truly comprehensive understand- 
ing of gene function and, more broadly, of the entire genome 
which can then be applied to other organisms. Such global 
approaches are essential if we are to gain an understanding of 
the living cell. This understanding should come from the 
viewpoint of the integration of complex regulatory networks, 
the individual roles and interactions of thousands of functional 
gene products, and the effect of environmental changes on 
both gene regulatory networks and the roles of all gene 
products. The time has come to switch from the analysis of a 
single gene to the analysis of the whole genome. 

Support was provided by National Institutes of Health Grants 
R37H60198 and P01H600205. 
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INTRODUCTION 

Technological advancements combined with in- 
tensive DNA sequencing efforts have generated an 
• enormous database of sequence information over the 
past decade. To date, more than 3 million sequences 
totaling over 2.2 billion bases [1], are contained 
within the GenBank database, which includes the 
complete sequences of 19 different organisms [21. The 
:rst complete sequence of a free-living organism 
Haemophilus influenzae, was reported in 1995 [3| and 
was roilowed shortly thereafter bv the first complete 
sequence or a eukaryote. Saccharvmvces cervisiae |4| 
The development of dramatically improved sequenc- 
ing methodologies promises that complete elucida- 
tion or tne Homo sapiens DNA sequence is not tar 
benind IS]. 

To expioirmore ruilv rne wealth or new sequence 
information, it was necessary to develop novel meth- 
ods ror the high-throughput or parallel monitoring 
of gene expression. Established methods such as 
northern blotting. RNAse protection assavs. SI nu- 
dease analysis, plaque hybridization, and slot blots 
do not provide sufficient throughput to effectively 
utilize the new genomics resources. Newer methods 
such as differential display [6], high^densiry filter 
hybndization [7.81. serial analysis of gene expression 
[9], and cDNA- and oligonucieotide-based microarray 
"chip- hybridization (10-12) are possible solutions 
to this bottleneck, it is our belief that the microarTay 
approach, which allows the monitoring of expres- 
sion levels of thousands of genes simultaneously, is 
a tool of unprecedented power for use in toxicology 
studies. 67 



Almost without exception, gene expression is al- 
tered during toxicity, as either a direct or indirect 
result of toxicant exposure. The challenge faon* 
toxicologists is to define, under a given set f ex- 
perimental conditions, the characteristic and spe- 
cific pattern of gene expression elicited bv a given 
toxicant. Microarray technology offers an ideal plat- 
form for this type of analysis and could be the foun- 
dation for a fundamentally new approach to 
toxicology testing. 

MICROARRAY DEVELOPMENT AND APPLICATIONS 
cDNA Microarrays 

In the past several years, numerous svstems were 
developed for the construction or lawscale DNA 
arravs. AH or tnese platrorms are oasea on cONAs 
or oligonucleotides immobilized to a solid sup- 
port. In the cONA approach. cDNA (or genomic) 
clones of interest are arrayed in a multi-well for- 
mat and amplified by polymerase chain reaction. 
The products of this amplification, which are usu- 
ally 500- to 2000-bp clones from the 3' regions of 
the genes of interest, are then spotted onto solid 
support by using high-speed robotics. By using 
this method, microarrays of up to 10 000 cl nes 
can be generated by spotting onto a glass substrate 
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(13.NI. Sample detection for microarravs on glass 
. inv Ives the use of probes labeled with fluores- 
cent or radioactive nucleotides. 

Fluorescent cDNA probes are generated from con- 
trol and test RNA samples in single-round reverse-tran- 
senption reactions in the presence of fluorescentlv 
tagged dLTP (e.g.. Cy3-dLTP and CyS-dLTP). which 
produces control and test products labeled with dif- 
ferent rluors. The cDNAs generated from these two 
populanons. colleaively termed the "probe.* are then 
mixed and hybridized to the arrav under a glass cov- 
erslip (10.11.15). The fluorescent signal is detected 
by using a custom-designed scanning confocal mi- 
croscope equipped with a motorized stage and lasers 
tor fluor excitation [10.ll.15j. The data are analyzed 
with custom digital image analysis software that de- 
termines for each ONA feature the ratio of fluor 1 to 
fluor 2. corrected for local background [16.17]. The 
strength of this approach lies in the ability to label 
RNAs from control and treated samples with differ- 
ent fluorescent nucleotides; allowing for the simul- 
taneous hybridization and detection of both 
populations on one microarray. This method elimi- 
nates the need to control tor hybridization between 
arrays. The research groups of brs. Patrick Brown and 
Ron Davis at Stanford University spearheaded the 
effort to develop this approach, which has been suc- 
cessfully applied to studies of Arabidopsis thaliana 
RNA (10), yeast genomic DNA [15], tumorigenic ver- 
sus non-tumorigenic human tumor cell lines [11], 
human T<ells [18]. yeast RNA |19], and human in- 
flammatory disease-related genes {20]. The most dra- 
matic result of this effort was the first published 
account or gene expression of an entire genome, that 
or the yeast Sacciuuomvces cenisnte (21). 

In an alternative approach, large numbers of cDNA 
clones can be sported onto a membrane support, al- 
bett at a lower densiry (7.221. This method is useful 
for expression profiling and large-scale screening and 
mapping ot genomic or cDNA clones {722 -24|. In 
expression proriime on hirer membranes, two air- 
rerent membranes are used simuitaneousiv ror con- 
trol and test RNA hybridizations, or a single 
membrane is stripped and reprobed. The signal is 
detected by using radioactive nucleotides and visu- 
alized by phosphorimager analysis or autoradiogra- 
phy. Numerous companies now sell such cDNA 
membranes and software to analyze the image data 

Oligonucleotide Microarrays 

_ Oligonucleotide microarrays are constructed either 
by spotting prefabricated oligos on a glass support 
[13] or by the more elegant method of direct in situ 
oligo synthesis on the glass surface by photolithos- 
raphy [28-30]. The strength of this apVroacS Z fn 
us ability to discriminate DNA molecules based on 
single base-pair difference. This allows the applica- 
tion or this method to the fi ids of medical diagnos- 
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tics, pharmacogenetics, and sequencing bv nvb 
iration as well as gene-expression analysis. 

fabrication of oUgonucleotide chips bv photolu 
thography is theoretical!** simple but technically 
complex [29.30|. The light from a high-intensity 
mercury lamp is directed through a photolitho. 
graphic mask onto the silica surface, resulting m 
deprotection of the terminal nucleotides in the 
minated regions. The entire chip is then reacted wuh 

elongation. This process requ.res onlv 4n cvcE 
where n « oligonucleotide length ,n basest t 

ber of which is limited only by the complexity oiZ 
photolithographic mask and the chip sue (29 J l y>\ 
Sample preparation involves the generation of 
doubie-stranded cDNA from cellular polvu* r\a 
foUowed by antisense RNA synthesis in an in vino 
transcription reaction with biotinvtated or fluor- 
ugged nucleotides. The RNA probe is then frag, 
mented to facilitate hybridization. If the indirect 
visualization method is used, the chips are incubated 
with rluor-tinked streptavidin ie.g.. phveocrvthrin) 
arter hybridization 1 12.33). The signal is detected with 
acustom confocal scanner [34|. This method has 
been applied successfully to the mapping of gen nuc 
library clones (35|. to de novo sequencing bv hybrid- 
ization [28.361, and to evolutionary sequence' com- 
parison of the BRCA1 gene [37]. In addition, 
mutations in the cystic fibrosis |38| and BRCA1 p9J 
gene products and polymorphisms in the human im- 
munodeficiency viru$.i clade B protease gene [401 
have been detected by this method. Oligonucleotide 
chips are also useful for expression monitoring «?3) 
as has been demonstrated by the simultaneous evalu- 
ation ot gene-expression patterns in nearlv ail open 
reading trames of the yeast strain S. ^revisiat |12|. 
More recently, oligonucleotide chips have been used 
to help identifv single nucleotide polvmorphismsin 
rhe human Ul| and veast (42! cenomes. 

THE USE OF MICROARRAYS IN TOXICOLOGY 
Screening for Mechanism of Action 

The field of toxicology uses numerous in vivo 
model systems, including the rat. mouse, and rab- 
bit, to assess potential toxicity and these bioaisays 
are the mainstay of toxicology testing. However, in 
the past several decades, a plethora of in vitro tech- • 
mques have been developed to measure toxicity, 
many of which measure toxicant-induced DNA dam- 
age. Examples of these assays include the Ames test 
the Syrian hamster embryo cell transformation as- 
say, micronucleus assays, measurements of sister 
chromatid exchange and unscheduled DNA synthe- 
sis, and many others. Fundamental to all of these 
method is the fact that toxicitv is ften preceded 
by. and results in. alterations in gene expression- In 
many cases, these changes in gene expressi n are a 
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far more sensitive, characteristic, and measurable 
ehdpoint than the toxicity useif. Wc therefore pro- 
pose that a method based on measurements of the 
genome-wide gene expression pattern of an organ- 
ism after toxicant exposure is fundamentally infor- 
"m-nve and complements the established methods 
described above. ~" 

We are developing a method by which toxicants 
can be identified and their putative mechanisms of 
action determined by using toxicant-induced gene ex- 
pression profiles. In this method, in one or more de- 
nned model systems, dose and time-course parameters 
are established for a senes of toxicants within a given 
proton-pic class ie.g.. polycydic aromatic hvdrocar- 
b:-is (PAHsn. Cells are then treated with these agents 
a: j fixed toxicity level las measured by cell survival). 
RNA is harvested, and toxicant-induced gene expres- 
sion changes are assessed by hybridization to a cDNA 
microarrav chip (Figure 1 1. We have developed a cus- 
tom DNA chip, called ToxChip vl.O. specifically for 
this purpose and will discuss it in more detail below. 
The changes in gene expression induced by the test 
aeents in the model systems are analyzed, and the 
common set of changes unique to that class of toxi- 
.nts. termed a toxicant signature, is determined. 
This signature is derived by ranking across all ex- 
periments the gene-expression data based on reia- 

Control 
Population 



tive fold induction or suppression of genes in treated 
samples versus untreated controls and seiecting me 
most consistently different signals across the sampie 
set. A different signature may be established tor each 
prototypic toxicant class. Once the signatures are de- 
termined. gene-expression profiles induced by un- 
known agents in these same model systems can then 
be compared with the established signatures. A match 
assigns a putative mechanism of action to the test 
compound. Figured illustrates this signature method 
for different types of oxidant stressors. PAHs. and 
peroxisome proliferators. In this example, the un- 
known compound, in question had a gene-expres- 
sion profile similar to that of the oxidant stressors in 
the database. We anticipate that this general method 
will also reveal cross talk between different pathwavs 
induced by a single agent (e.g.. reveal that a com- 
pound has both PAH-like and oxidant-like pr per- 
ties). In the future, it may be necessary to distinguish 
very subtle differences between compounds within 
a very large sample set te.g.. thousands of highlv simi- 
lar structural isomers in a combinatorial chemistrv 
library or peptide library). To generate these highlv 
refined signatures, standard statistical clustering tech- 
niques or principal-component analysis can be used. 

For the studies outlined in Figure 2. we developed 
the custom cDNA microarrav chip ToxChip vl.O. 

Treated 
Population 



Cy3 




RNA Isolation 



Reverse 
Transcription 




Cy5 




Mix cDNAs and 
Apply to Array 



DNA "Chip" 



Hybridize Under' 
Coverslip 




F^ure 1. Simplified overview of the method for sample 
preparation and hypridizatton to cDNA mtcroarrays, for illus- 



trative purposes, samples derived from cell culture are depicted, 
atthougn other sample types are amenable to this analysts. 
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cturacttrotK to tMt typ of *^t££5\Sl?~*~ 

The 2090 human genes that comprise this subarrav 
were selected for their well-docimented invoil? 
mem m basic cellular processes as well as their re- 
sponses to different rvpe, of toxic insult. Inched 
on th»s hst are DNA replication and repair genes 
apoptOMs genes, and genes responsive to PAH, and 
d ..like compounds, peroxisome proliferators 
Z £ C COm P° undj - and oxidant stress. Some of 

Ll^ " Cate? ° rieS ° f gen " indude transcnption 
actors, oncogenes, rumor suppressor «nes. cvchns 

«n« a„ P dT Pha r S - Ce " adh «™ *"d Jt 
group are 84 housekeeping genes, whose hybridiza- 
non intensity ,s averaged and used for signal nor- 
mahzanon of the other genes on the chip 8 ?o date 
toXKint t^ been shown to have app£ 
mJ t?*TZ° n <* «"* houseke^- 

n? • H T m ' thU hou »^Ping list will be 
aTlTJ T daU WaiTant th " d <«tion or deletion 
of a particular gene. Table 1 contains a general de- 
scription of some of the different classfsTge* 
that comprise ToxChip vl 0 

«™ ZLTi Cint " gnatUre iJ d ««n,ined. the 

s«een« uncna ™«^ toxicants are then 

S1 f S; n Cin * qUlCUy refom »««l » that 
blocks f genes representing the different signatures 



th« unknown .gent m «* h * n '* m of ■««>" « asugned to 

are displayed (U|. This facilitates rapid, visual in- 
Ch p *2.0 and ch.ps tor other model svstrni 

Animal Models in Toxicology Testing 

The toxicology commun.tv relies heavilv on the 
use or animals as model svstems ror toxicology test- 

Dens«-J , ^ ,na, ? iV - theSe Jre ""lerentiv ex- 
pen" e. requ,re large numoers or amtnals and take a 

vTtUrSin?, C ° mp,e c te 3nd analV2e - Th^ore. the 
iNIEhS h T C ° f Env "° nme ™< Health Sciences 
oxfcofo^, " al Toxicol °8y ^ogram. and the 
H X C ° mm " n "r at ,ar * e are committed to re- 
duang the number of animals used, by developing 

deve? 0 r g ^ Stan ! ia, Pr ° gre " has °«n »ade i7*e . 
stiS i^f »° f al,ernative m «h°*. bioassays lit 
still used for tesnng endpoints such as neurotoxic. 

«1 toxicology, and genetic toxicology. The rodent 
cancer b.oassay .s a part.cuiarly expensive and time- 
consuming assay, as it requires almost 4 yr 1200 

yze 143J. In vitro experiments f the type outlined 
•n Figure 2 might provide evidence that an unknown 
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Table 1. ToxChip v1.0: A Human cONA Microarray 
"Chip Designed to Detea Responses to Toxic insult 

No cfaenes 

Gey category or 



Accatosis 

OK - replication ano reoair 
Oiioative srresyreoeMomeostasis 
Peroxisome oroiiferator resoonsive 
-DiO*«n/?AH resoon$ive 
Estrogen resoonsive 
House<eeotng 

Oncogenes and tumor suooressor genes 

Cen-cyce control 

Transcription faaors 

Kinases 

P— .snatases 

Hr-->snock protems 

Receotors 

C^OCnrcme P450s 



a 
99 
90 
22 
12 
63 
84 
76 
51 
131 
276 
68 
23 
349 
30 



•Th.$ »st * .ntenoea as a general guioe The gene eategones are not 
unioue. ano some genes are «steo m muitio* categories. 

agent is ror is not) responsible for eliciting a given 
biological response. This information would help to 
?p ; ?ci a bioassay more specifically suited to the agent 
i. question or perhaps suggest that a bioassav is not 
necessary, which would dramatically reduce cost, 
animal use. and time. 

The addition of microarray techniques to stan- 
dard bioassays may dramatically enhance the sen- 
sitivity and intestability of the bioassav and 
possibly reduce its cost. Gene-expression signatures 
could be determined for various types of tissue-spe- 
cific toxicants, and new compounds could be 
' -eened tor these characteristic signatures, provid- 
ing a rapid and sensitive in vivo test. Also, because 
gene expression is often exquisitely sensitive to low 
doses of a toxicant, the combination of gene-expres- 
sion screening and the bioassav might allow the use 
of lower toxicant doses, which are more relevant to 
human exposure levels, and the use of fewer ani- 
mais. in adcition. sene-expression changes are nor- 
mally measured in nours or davs. not in the montns 
to years required for tumor development. Further- 
. lore, microarrays might be particularly useful for 
investigating the relationship between acute and 
chronic toxicity and identifying secondary effects 
of a given toxicant bjrstudying the relationship 
between the duration of exposure to a toxicant and 
the gene-expression profile produced. Thus, a bio- 
assay that incorporates gene-expression signatures 
with traditional endppints might be substantially 
shorter, use more realistic dose regimens, and cost 
substantially less than the current assays do. 

These considerations are also relevant for branches 
of toxicology not related to human health and not 
using rodents as model systems, such as aquatic toxi-~ 
cology and plant pathology. Bioassays based on the 
flathead minnow, Daphnia. and Arabadopsis could 



also be improved by the addition of microarrav aa Jh . 
us. The combination of microarravs wim rradmona 
bioassays might also be useful for investigating some 
of the more intractable problems tn toxicology re- 
search, such as the effects of complex mixtures ana 
the difficulties in cross-species extrapolation. 

Exposure Assessment Environmental Monitoring, 
and Drug Safety 

The currently used methods for assessment of ex- 
posure to chemical toxicants are based on measure- 
ment of tissue toxin levels or on surrogate markers 
of toxicity, termed biomarkers (e.g.. peripheral blood 
levels of hepatic enzymes or DNA adductsi. Because 
gene expression is a sensitive endpoint. gene expres- 
sion as measured with microarray technology mav 
be useful as a new biomarker to more precisely iden- 
tify hazards and to assess exposure. Similarly 
microarrays could be used in an environmental- 
monitoring capacity to measure the effect of poten- 
tial contaminants on the gene-expression profiles 
of resident organisms. In an analogous fashi n 
microarrays could be used to measure gene-expres- 
sion endpoints in subjects in clinical trials. The c m- 
bmation or these gene-expression data and more 
established toxic endpoints in these trials could be 
used to define highly precise surrogates of safetv. 

Gene-expression profiles in samples from exposed 
individuals could be compared to the profiles of the 
same individuals before exposure. From this inf r- 
mation. the nature of the toxic exposure can be de- 
termined or a relative clinical safetv factor estimated. 
In the future it may also be possible to estimate not 
only the nature but the dose of the toxicant f r a 
given exposure, based on relative gene-expression 
levels. This general approach mav be particularly 
appropriate tor occupational-health applications, in 
which unexposed and exposed samples from the 
same individuals may be obtainable. For example, 
a pilot study of gene expression in peripherai-blood 
Ivmphocvtes or Polish coke-oven workers exposed 
:o TAHs tana manv otner comooundsi is under con- 
sideration arthe NTEHS. An important consideration 
tor these types of studies is that gene expression can 
be affected by numerous faaors. including diet, 
health, and personal habits. To reduce the effects 
of these confounding faaors. it may be necessary 
to compare pools of control samples with pools of 
treated samples. In the future it may be possible to 
compare exposed sample sets to a national database 
of human-expression data, thus eliminating the 
need to provide an unexposed sample from the same 
individual. Efforts to develop such a national gene- 
expression database are currently under way (44,45). 
However, this national database approach will re- 
quire a better understanding of genome-wide gene 
expression aaoss the highly diverse human p pu- 
lation and of the effeas of environmental faaors 
on this expression. 
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Abstract 

Recent progress in genomics and proteomics technologies has created a unique opportunity to significantly impact 
the pharmaceutical drug development processes. The perception that cells and whole organisms express specific 
inducible responses to stimuli such as drug treatment implies that unique expression patterns, molecular fingerprints, 
indicative of a drug's efficacy and potential toxicity are accessible. The integration into state-of-the-art toxicology of 
assays allowing one to profile treatment-related changes in gene expression patterns promises new insights into 
mechanisms of drug action and toxicity. The benefits will be improved lead selection, and optimized monitoring of 
druc efficacy and safety in pre-climcal and clinical studies based on biologically relevant tissue and surrogate markers. 
C 2000 Elsevier Science Ireland Ltd. AH rights reserved. 

Keiwtk: Proicomics: Genomics: Toxicology 
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1. Introduction 

The majority of drugs act by binding to protein 
targets, most to known proteins representing en- 
zymes, receptors and channels, resulting in effects 
such as enzyme inhibition and impairment of 
signal transduction. The treatment-induced per- 
turbations provoke feedback reactions aiming to 
compensate for the stimulus, which almost always 
are associated with signals to the nucleus, result- 
ing in altered gene expression. Such gene expres- 
sion regulations account for both the 
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pharmacological action and the toxicity of a drug 
and can be visualized by either global mRNA or 
global protein expression profiling. Hence, for 
each individual drug, a characteristic gene regula- 
tion pattern, its molecular fingerprint, exists 
which bears valuable information on its mode of 
action and its mechanism of toxicity. 

Gene expression is a multistep process that 
results in an active protein (Fig. I). There exist 
numerous regulation systems that exert control at 
and after the transcription and the translation 
step. Genomics, by definition, encompasses the 
quantitative analysis of transcripts at the mRNA 
level, while the aim of proteomics is to quantify 
gene expression further down-stream, creating a 
snapshot of gene regulation closer to ultimate cell 
function control. 
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2. Global mRNA profiling 

Expression data at the mRNA level can be 
produced using a set of different technologies 
such as UNA microarrays, reverse transcript 
imaging, amplified fragment length polymorphism 
(AFLP). serial analysis of gene expression 
(SAGEl and others. Currently. DNA microarrays 
are very popular and promise a great potential. 
On a typical array, each gene of interest is repre- 
sented either by a long DNA fragment (200-2400 
bp » typically generated by polymerase chain reac- 
tion (PCR) and spotted on a suitable substrate 
using robotics (Schena et aL 1995; Shalon et aL 
1996) or by several short oligonucleotides (20-30 
bp) synthesized directly onto a solid support using 
photolabile nucleotide chemistry (Fodor et aL. 
1991; Chee et aL. 1996). From control and treated 
tissues, total RNA or mRNA is isolated and 
reverse transcribed in the presence of radioactive 
or fluorescent labeled nucleotides, and the labeled, 
probes are then hybridized to the arrays. The 
intensity of the array signal is measured for each 
gene transcript by either autoradiography or laser 
scanning confocal microscopy. The ratio between 
the signals of control and treated samples reflect 
the relative drug-induced change in transcript 
abundance. 



3. Global protein profiling 

Global quantitative expression analysis ai ih e 
protein level is currently restricted to the use of 
two-dimensional gel electrophoresis. This tech- 
nique combines separation oi* ;i*sue proteins bv 
isoelectric focusing in ihe nrst dimension and by 
sodium dodecyl sulfate slab ge! electrophoresis, 
based molecular weight separation on the second, 
orthogonal dimension (Anderson et aL. I99j) 
The product is a rectangular pattern of protein 
spots that are typically revealed oy Coomassie 
Blue, silver or fluorescent staining iFis. v 
Protein spots are identified by mass spectrometry 
following generation of peptide mass fingerprints 
(Mann et aL. 1993) and sequence tags (VVilkins et 
aL. 1996). Similar to the mRNA approach, the 
ratio between the optical densit> of spots from 
control and treated samples are compared to 
search for treatment-related changes. 

4. Expression data analysis 

Bioinformatics forms a key element required to 
organize, analyze and store expression data from 
either source, the mRNA or the protein level. The 
overall objective, once a mass of high-quality 
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quantitative expression data has been collected, is 
to visualize complex patterns of gene expression 
chanees, to detect pathways and sets ot genes 
ughtfv correlated with treatment efficacy and toxi- 
ciiv ind to compare, the effects of different sets of 
treatment (Anderson et.aL 1996). As the drug 
effect database is growing, one may detect similar- 
ities and differences between the molecular finger- 
prints produced by various drugs, information 
that mav be crucial to make a decision whether to 
refocus or extend the therapeutic spectrum of a 
drue. candidate. 



5. Comparison of global mRNA and protein 
expression profiling 

There are several synergies and overlaps of data 
obtained by mRNA and protein expression analy- 
. sis. Low abundant transcripts may not be easily 
quantified at the protein level using standard two- 
dimensional gel electrophoresis analysis arid their 
detection may require prefractionation of sam- 
ples The expression of such genes may be prefer- 
ablv quantified at the mRNA level using 
techniques allowing PCR-mediated target amplifi- 
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cation. Tissue biopsy samples typically yield good 
quality of both mRNA and proteins: however, the 
quality of mRNA isolated from body fluids is 
often poor due to the faster degradation of 
mRNA wHeh compared with proteins. RNA sam- 
ples from body fluids such as serum or unne are 
often not very "meaningful*, and secreted proteins 
are likely more reliable surrogate markers for 
treatment efficacy and safety. Detection of post- 
radiational modifications, events often related to 
function or nonfunction of a protein, is restricted 
to protein expression analysis and rarely can be 
predicted by mRNA profiling. Information on 
subcellular localization and translocation of 
proteins has to be acquired at the level of the 
protein in combination with sample prefractiona- 
tion procedures. The growing evidence of a poor 
correlation between mRNA and protein abun- 
dance (Anderson and Seilhamer. 1997) further 
suggests that the two approaches. mRNA and 
protein profiling, are complementary and should 
be applied in parallel. 



6. Expression profiling and drug development 

Understanding the mechanisms of action and 
toxicity, and being able to monitor treatment 
efficacy and safety during trials is crucial for the 
successful development of a drug. Mechanistic 
insights are essential for the interpretation of drue 
effects and enhance the chances of recognizing 
potential species specificities contributing" 10 an 
improved risk profile in humans (Richardson et 
aL 1995: Steiner et aL 1996b: Aicher et aL 1998). 
The value of expression profiling further increases 
when links between treatment-induced expression 
profiles and specific pharmacological and toxic 
endpoints are established (Anderson et aL 1991. 
1995. 1996: Sterner et al. 1996a). Changes in gene 
expression are known to precede the manifesta- 
tion of morphological alterations, giving expres- 
sion profiling a great potential for early 
compound screening, enabling one to select drug . 
candidates with wide therapeutic windows 
reflected by molecular fingerprints indicative of 
high pharmacological potency and low toxicity 
(Arce et aL 1998). In later phases of drug devef- 
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7. Perspectives 

The basic methodolog> of safety evaluation has 
changed little during the past decades. Toxicity m 
laboratory animals has beer, evaluated rnmanlv 
by using hematological, clinical chenustn >nc* 
histological parameters a> indicators of oruan 
damage. The rapid progress in genomics and pro. 
teomics technologies creates a unique opportunity 
to dramatically improve the predictive power of 
safety assessment and to accelerate the drug devel- 
opment process. Application of gene and "protein 
expression profiling promises to improve lead se- 
lection, resulting in the development of drug can- 
didates with higher efricac> and lower ioxic : **\ 
The identification of biologically relevant sur: - 
gate markers . correlated with treatment efficacy 
and safety bears a great potential to optimize the 
monitoring of pre-chnical and clinical trails. 
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Application of DNA Arrays to Toxicology 



John C. Roekett and David J. Dix 
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Agency. Research Triangle Part. North Carolina. USA 



DNA unv technology makes it possible to rapidly genotype individuals or qujuroiv the egression 
of thousand* of genes on a single fiher or glass slide, and holds coormous potential in toxicologic 
applications. This potential led to a US. Environmental Protection Agency-sponsored workshop 
titled "Application of Microanw to TosacoJogv" on 7-8 January 1999 in Research Triangie Park. 
North Carolina. In addition to providing state-of-the-art information on the application of DNA or 
gene niicroarrays. the workshop catalyzed the formation of several collaborations. committees, and 
user s groups throughout the Research Triangle Park area and beyond. Potential application of 
miaoarrsys to toxicologic research and risk assessment include genome-wide expression analyses to 
identify gene-cxprcssion networks and tcaicant-spedfic signatures that on be used to define mode 
of action, for exposure assessment, and for environmental monitoring. Arrays may also prove useful 
for monitoring genetic variability and its relationship to toxicant susceptibility in human popula- 
tions. Key words. DNA arrays, gene arrays, microarrays, toxicology. Smmiron Hesbh Prrtpect 
107:681-685(1999). lOnline 6 July 1999] 
hnp://ehpn*tl.nuhi.Mh.xov/Jpet/J999/107p681&5r^ 



Decoding the genetic blueprint is a dream that 
often manifold returns in terms of undemand- 
ing how organisms develop and tuncoon tn an 
often hostile environment. With the rapid 
advances in molecular biology over the last 30 
vein, the dream has come a step closer to reali- 
rv. Molecular biologists now have the ability to 
elucidate the composition of any genome. 
Indeed, almost 20 genomes have already been 
sequenced and more than 60 are currently 
under way. Foremost among these is the 
Human Genome Mapping Proieo. However, 
the genomes of a number or commonly used 
laboratory* species are also under intensive 
investigation, including yeast. Arabidapsu* 
maize, nee. zebra fish, mouse, rat. and dog. it 
is widely expected that the completion of such 
programs will facilitate the development ot 
mam* powenui new techniques and approach- 
es to diagnosing ana rxeaung ^rncucaiJv and 
?n\irorurientauy induced diseases which amict 
mankind. However, the vast amount of data 
being generated by genome mapping will 
require new high-throughput technologies to 
investigate the function of the millions of new 
genes that are being reported. Among the most 
widely heralded of the new functional 
genomics technologies are DNA arrays, which 
represent perhaps the most anticipated new 
molecular biology technique since pofyrncrasc 
chain reaction (PCR). 

Arrays enable the study of literally thou- 
sands of genes in a single experiment. The 
potential importance of arrays is enormous and 
has been highlighted by the recent publication 
of an entire Nature Grnena supplement dedi- 
cated to the technology (7). Despite this huge 
surge of interest. DNA arrays are soil little used 
and largely unproven. as demonstrated by the 
high ratio of review and press articles to actual 
dan papers. Even so. the potential they offer 



has driven venture capitalists into a frenzy of 
investment and many new companies are 
springing up to claim a share of this rapidly 
developing market. 

The U.S. Environmental Protection 
Agency (EPA) is interested in applying DNA 
array technology to ongoing toxicologic stud- 
ies. To learn more about the current state of 
the technology, the Reproductive Toxicology 
Division (RTD) of the National Health and 
Environmental Effects Research Laboratory 
(NHEERL Research Triangle Park. NO 
hosted a workshop on "Application of 
Microarrays to Toxicology" on 7-8 January 
1999 in Research Triangle Park. North 
Carolina. The workshop was organized by 
David Dix. Robert KaviocL and John Rockert 
of the RTD/NHEERL. Twenry-rwo intra- 
mural and extramural scientists from govern- 
ment, acactenua. ana industry snared inrorma- 
tion. data, and ooinions on the current and 
future applications for this exaring new tech- 
nology. The workshop had more than 130 
attendees, including researchers, students, and 
— administrators from the EPA, the National 
Institute of Environmental Health Sciences 
(NIEHS). and a number of other establish- 
ments from Research Triangle Park and 
beyond. Presentations ranged from the tech- 
nology behind array production through die 
sharing of actual experimental data and projec- 
tions on the future importance and applica- 
tions of arrays. The wforrnadon contained in 
the workshop presentations should provide aid 
and insight into arrays in general and their 
application to toxicology in pa r ti cu l ar . 

Array El merits 

In the context of molecular biology, the word 
"array* is normally used to refer to a series of 
DNA or protein demena firmly attached in 



a recular partem to sorr.r kind ot supportive 
medium. DNA arrav is often usee inter- 
chaneeabiv with gene arrav or mtcroarray. 
Although not formaiiv defined, microarrav is 
generally used to describe the higher density 
arrays typically printed on glass chips. The 
DNA elements that make up DNA arravs 
can be oligonucleotides, partui gene 
sequences, or rull-iength cDNAs. Companies 
offering pre-made arravs that contain iess 
than full-length ciones normally use regions 
of the genes which are specific to that gene to 
prevent false positives ansing through cross- 
hybridization. Sequence verification of 
cDNA done identity is necessary because of 
errors in identifying specific clones from 
cDNA libraries and databases. Premade 
DNA arrays printed on membranes are cur- 
rently or imminently jvailable for human, 
mouse, and rat. in most cases they contain 
DNA sequences representing several thou- 
sand different sequence clusters or genes as 
ddineated through the National Center for 
Biotechnology Information UniGene Project 
(J). Many of these different UniGene dusters 
(putative genes) are represented only by 
expressed sequence tags (ESTs). 

Array Printing 

Arrays are typically printed on one of two 
types of support matrix. Nylon membranes 
are used by most orT-the-shelf array providers 
such as Clontech Laboratories, Inc. 
(Palo Alto. C\). Genome Systems, Inc. (St. 
Louis. MO I. and Research Genetics. Inc. 
(Huntsville. AL). Microarravs such as those 
proauced by ArTymetrtx. Inc. i Santa Clara, 
GVi. lncyte Pharmaccuacals. Inc. (Palo Alto. 
C\). and many do-it-yoursdf (DIY) arraying 
groups use glass wafers or slides. Although 
standard microscope slides may be used, they 
must be preprepared to facilitate sticking 
of the DNA to the glass. Several different 
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coatings have been successfully used, includ- 
ing silane and lysine. The coating of slides 
can easily be carried out m the laboratory, 
bur many prefer the convenience or prccoatec 
slides available from suppliers. 

Once the support matrix has been pre- 
pared, the DNA elements can be applied by 
several methods. Afrymerrix. Inc. his devel- 
oped a unique photolithographic technology 
for anaching oligonucleotides to glass wafers. 
More commonly. DNA is applied by either 
noncontacr or contact printing. Noncontact 
printers can use thermal, soienoid. or piezoelec- 
tric technology to spray aiiquots of solution 
onto the support matrix and may be used to 
produce slide or membrane-based arravs. 
Cartesian Technologies. Inc. (Irvine. CA) has 
developed nQUAD technology for use in its 
PixSvs printers. The system couples a syringe 
pump with the microsoienoid valve, a combi* 
nauon that provides rapid quantitative dispens- 
ing of nanoliter volumes idown to O nlj over 
a variable volume range. A different approach 
to noncontact printing uses a solid pin and ring 
combination (Genetic MicroSystems. Inc.. 
Wobuxn. MA). This system (Figure 1 J allows a 
broader range of sample, including cell suspen- 
sions and particulates, because the printing 
head cannot be blocked up in the same wav as 
a spray nozzle. Fluid transfer is controlled in 
this system primarily by the pin dimensions 
and the force of deposition, although the 
nature of the support matrix and the sample 
will also afreet transter to some degree. 

In contact printing, the pin head is dipped 
in the sample and then touched to the support 
matrix to deposit a small aliquot. Split pins 
were one ot the first contact-printing devices 
to be reported and are the suggested format 
for DIY arrayers, as described by Brown (J). 
Split pins are small metal pins with a precise 
groove cut vertically in the middle of the pin 
nc. in this system, i-48 spiit pins are posi- 
tioned in the pin-head Tne spiit pins wont bv 
simple capillary action, not unlike a fountain 
pen— when the pin heads are dipped in the 
sample, liquid is drawn into the pin groove. A 
small (fixed) volume is then deposited each 
time the split pins are gently touched to 
the support matrix. Sample (100-500 pL 
depending on a variety of parameters) can be 
deposited on multiple slides before refilling is 
required, and array densities of > 2.500 
spots/cm 2 may be produced The deposit vol- 
ume depends on the split site, sample fluidi- 
ry. and the speed of printing. Split pins are 
relatively simple to produce and can be made 
m-house if a suitable machine shop is avail- 
ible. Alternatively, they can be obtained 
iirectly from companies such as TeJeChem 
ntemauonal. Inc. (Sunnyvale. CA). 

Irrespective of their source, printers 
hould be run through a preprint sequence 
>nor to producing the actual experimental 



arrays: the first 100 or so spots of a new run 
tend to be somewhar variable. Factors erTert- 
ine spor reproducibility include slice treat- 
ment homogcneiry. sample dirTerences. and 
instrument errors. Other factors that come 
into play include clean eiection of the drop 
and clogging InQl'AD printing* and 
rnecrunicaJ variations and long-term alter- 
ation in print-head surace of solid and split 
pins. However, with careful preparation it is 
possible to get a coefficient of variance for 
spot reproducibility below 10%. 

One potential printing problem is sample 
carryover. Repeated washing, blotting, and 
drying (vacuum) of print pins between samples 
is normally enecove at reducing sample carrv- 
over to negligible amounts. Printing should 
also be carried out in a controlled environ- 
ment. Humidified chambers are available in 
which to place printers. These help prevent 
dust contamination and produce a uniform 
drying rate, which is important in determining 
spot size, quality, and reproducibility. 

In summary, although several printing 
technologies are available, none are par- 
ticularly outstanding and the bottom line 
is tnat they are still in a retativelv eariv stage 
of evolution. 

Array Hybridization 

The hybridization protocol is, practically 
speaking, relatively straightforward and those 
with previous experience in blottine should 
have little difficulty. Array hybridizations 
are. in essence, reverse Southern/Northern 
blots— instead of applying a labeled probe to 
the target population of DNA/RNA. the 
labeled population is applied to the probets). 
With membrane-based arrays, the control and 
rreated mRNA populations are normaUv con- 
verted to cDNA and labeled with isotope (e.g.. 
" P) in the process. These labeled populations 
are tnen nvononea indeseaoestiv to oaraiiei 
or senaj arrays and the hvbridaaoon sicnai is 
detected with a phosponmager. A less com- 
monly used alternative to radioactive probes is 
enzymatic detection. The probe mav be 
biotinylated. haprcnyiated^r have alkaline 
phosphatase/horseradish peroxidase attached. 
Hybridization is detected by enzymatic rac- 
oon yielding a color reaction (4). DirTerences 
in hybridization signals can be detected bv eve 
or, more accurately, with the help of digital 
imaging and commercially available software. 
The labeling of the tat populadons for slide- 
based microarrays uses a slightly different 
approach. The probe rypicallv consists of two 
samples of polyA* RNA (usually from a created 
and a control population) that are converted to 
cDNA; in the process each is labeled with a 
different fluor. The independently labeled 
probes are then mixed together and hybridized 
to a single rnicroarray slide and the resulting 
combined fluorescent signal is scanned. After 




Figurt I. Genetic Microsystems (Woburn. MAI pin 
ring system tor owning arrays. The pin nng com- 
bination consists ot a circular open nng oriented 
parallel to the sample solution, with a vertical pin 
centered over tne ring. When the nng is dipptd 
into a solution and lifted, it withdraws an aliquot 
of sample held by surface tension. To spot the 
samoie. the pm is driven down through the ring 
and a portion of the solution is transferred to the 
bottom of the pin. The pin continues to move 
downward until the pendant drop of solution 
makes contact with the underlying surface. The 
pm is then lifted, and gravity and surface tension 
cause deposition of the spot onto the array. 
Figure from Flowers et al. [14). with permission 
from Genetie Microsystems. 

normalization, it is possible to determine the 
ratio of fluorescent signals from a single 
hybridization or* a slide-based miooarrav. 

cDNA derived from control and rreated 
populations of RNA is most commonly 
hybridized to arrays, although subtractive 
hybridization or differential display reactions 
mav also be used. Fluorophore- or radiola- 
beled nucieotiaes are ciirectiy incorporated 
into the cDNA in the Drocess of converting 
RNA to cDNA. Alternatively. 5' end-labeied 
primers may be used for cDNA synthesis. 
These are labeled with a fluorophore for 
direct visualization of the hybridized array. 
Alternatively, biotin or a hapten may be 
attached to the primer, in which case fluor- 
labeled streptavidin or antibody must be 
applied before a signal can be generated. The 
most commonly used fluorophores at present 
are cyanine (Cy)3 and Cy5 (Amersham 
Pharmacia Biotech AB. Uppsala, Sweden). 
However, the relative expense of these fluo- 
rescent conjugates, has driven a search for 
cheaper alternatives. Fluorescein, rhodarnine, 
and Texas red have all been used, and 
companies such as Molecular Probes. Inc 
(Eugene. OR) are developing a series of 
labeled nucleotides with a wide range of exci- 
tation and emission spectra which may prove 
to function as well as the Cy dyes. 
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Analysis of DNA Microarrays 

Mernbrane-basec arrays art normaliy analyzed 
on aim or with a phosphonmager. whereas 
Clio-based arrays require more specialized scan- 
ning devices. Tnese can be divided into three 
main groups: the dorse-coupled device camera 
systems, the noricortrocai User scanners, and the 
cortfocaJ laser scanners. The advantages and dis- 
advanraces or* each system are listed in Tabic 1 . 

Because a rypical spot on a microarray can 
contain > 10 6 molecules, it u clear that a large 
variation in signal strength may occur. 
Current scanners cannot work across this 
manv orders of magnitude (4 or 5 is more typ- 
ical). However, the scanning parameters can 
normally be adjusted to collect more or less 
sienal. such that two or three scans of the same 
array should permit the detection of rare and 
abundant genes. 

When a microarray is scanned, the fluores- 
cent imaees are captured by sorrware normally 
included with the scanner. Several commercial 
suppliers provide additional sorrware for quan- 
tifying array imaees. but the sorrware tools are 
constantly evolving to meer the developing 
needs of researchers, and it is prudent to 
define one's own needs and clarify the exact 
capabilities of the sorrware before is purchase. 
Issues that should be considered include the 
following: 

* Can the software locate ofrset spots? 

* Can it quanntate across irregular hybridiza- 
tion signals? 

• Can the arrayed genes be programmed in for 
easy identification and location? 

• Can the software connect via the Internet to 
databases containing further information on 
the gene(s) of interest: 

One of the key* issues raised at the work- 
shop was the sensiriviry of microarray technol- 
ogy. Experiments by General Scanning. Inc. 
^'ateiTDwru MA), have shown that by using 
the Cy dyes and their scanner, signal can be 
detected down to levels of < I fiuor molecule 
per square micrometer, which translates to 
detecting a rare message at approximately one 
copy per cell or less. 

Array Applications 

Although arrays are an emerging technology 
certain to undergo improvement and 
alteration, they have already been applied use* 
fullv to a number of mode) systems. Arrays are 
at their most powerful when they contain the 
entire genome of the species they are being 
used to study. For this reason, they have strong 
support among researchers utilizing yeast and 
Cienorhabditis elegant (5). The genomes of 
both of these species have been sequenced and 
in the case of yeast, deposited onto arrays for 
examination of gene expression (6.7). With 
both of these species, it is relatively easy to 
perturb individual gene expression. Indeed. C 



Table 1. Advantages and disadvantages of different microarray scanning ivnems 
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elegant knockouts can be made simply by 
soaking the worms in an antisense solution of 
the gene to be knocked out. 

By a process of systematic gene disrup- 
tion, it is now possible to examine the cause 
and effect relationships between different 
genes in these simple organisms. This kind of 
approach should help elucidate biochemical 
pathways and genetic control processes, 
deconvolute polygenic interactions, and 
define the architecture of the cellular network. 
A simple case study of how this can be 
achieved was presented by Butow [University 
of Texas Southwestern Medical Center. 
Dallas, TX (Figure 2)]. Although it is the 
phenorypic result of a single gene knockout 
that is being examined, the effect of such 
perturbation will almost always be polygenic 
Polygenic interactions will become increasing- 
ly important as researchers begin to move 
away from single gene systems when examin- 
ing the nature of toxicologic responses to 
external stimuli This is especially important 
in toxicology because the phenorype pro- 
duced by a given environmental insult is 
never the result of the action of a single gene: 
rather, it is a complex interaction of one or 
multiple cHltilar pathways. Phenomena such 
as quantitative trait (the continuous variation 
of phmrrxypc l. esistasis tthe erT e c: of aiietes of 
one or more genes on the expression or otnex 
genes), and penetrance (proportion of indi- 
viduals of a given genotype that display a par- 
ticular phenorype) will become increasingly 
evident and important as toxicologists push 
toward the ultimate goal of matching the 
responses of individuals to different 
environmental stimuli. 

Analysis of the transcriptome (the expres- 
sion level of all the genes in a given cell popula- 
tion) was a use of arrays addressed by several 
speakers. Unfortunately, current gene nomen- 
clature is often confusing in that single genes 
are allocated multiple names (usually as a result 
of independent discovery by cUHerent laborato- 
ries), and there was a call for standardization of 
gene nomenclature. Nevertheless, once a tran- 
scriptome has been assembled it can then be 
oansTcrrcd onto arrays and used to screen any 
chosen system. The EPA MicroArray 
Consortium (EPAMAQ is assembling testes 



cranschptomes for human, rat. and mouse. In a 
slightly dirrerent approach. Nuwavsir et ai. lA 
describes hem- the NIEHS assembled whai is 
effectively a "toxicoioeical traruenprome* — a 
library of human and mouse genes that have 
previously been proven or implicated in 
responses to toxicologic insults. Clontech 
Laboratories, inc. (Palo Alto. CAh has begun a 
similar process by developing stress/ toxicology 
filter arrays of rat. mouse, and human genes. 
Thus, rather than being tissue or cell specific, 
these stress/ toxicology arravs can be used across 
a vinery of model systems to look for alter- 
ations in the expression of toxicologically 
important genes and define the new field of 
toxicogenomics. The potential to identify toxi- 
cant families based on tissue- or ceU-spcciftc 
gene expression could revolutionize drug test- 
ing. These molecular signatures or fingerprints 
could not only point to the possible 
toxiciry/carcinogeniciry of newly discovered 
compounds (Figure 5). but also aid in elucidat- 
ing their mechanism of action through identifi- 
cation of gene expression networks. By exten- 
sion, such signatures could provide easily iden- 
tifiable biomarkers to assess the degree, time, 
and nature of exposure. 

DNA arravs are primarily a tool for exam- 
ining differential gene expression in a erven 
model, in this context thev are mili e u to as 
dosed systems because they lack the ability of 
other dir Teien naJ expression technologies, eg., 
differential display and subtracxrve hybridiza- 
tion, to detect previously unknown genes not 
present on the array. This would appear to 
limit the power of DNA arrays to the imagina- 
tions and preconceptions of the researcher in 
selecting genes previously characterized and 
thought to be involved in rite model system. 
However, the various genome sequencing pro- 
jects have created a new category of 
sequence — the EST — that has partially molli- 
fied this deficiency. ESTs arc cDNAs expressed 
in a given tissue that, although they may share 
some degree of sequence similarity id previous* 
ly characterized genes, have not been assigned 
specific genetic identity. By incorporating EST 
clones into an array, it is possible to monitor 
the expression of these unknown genes. This 
can enable the identification of previously 
un characterized genes that may have biologic 



significance in the modci system. Filter arravs 
rrom Research Genetics and si id: otto rrom 
inevte Pharmaceuticals both incorporate hxzt 
numoen of £ST$ rrom a variety or species. 

A further use of miexoarrays is the identin- 
cation of single nucleotide poivmorphisms 
■.SNPs'. These genomic variations are abun- 
dant — ~mev occur aoDnsximazeK* even- 1 kb or 
so — 2nd are the basts of restriction fragment 
length poivmorphism anaiysts used in forensic 
analysis. Arryrnetm. Inc. designed chips that 
contain multiple repeats of the same gene 
sequence. Each position is present with all rbur 
possible bases. After the hvbndization of the 
sample, the decree of hybridization to the dif- 
ferent sequences can be measured and the exact 
sequence of the target eene deduced. SNPs are 
thought to be of vital importance in drug 
metabolism and toxicology. For example, sin- 
gle base differences in the rezuiatory region or 
active site of some genes can account tor huge 
differences in the activity of that gene. Such 
SNPs are thought to explain why some people 
are abie to metabolize certain xenobiotics bet- 
ter than others. Thus, anays provide a further 
tool for the toxicologist investigaunc the 
nature of susceptible subpopulations and toxi- 
cologic response. 

There are still many wrinkles to be ironed 
out before arrays become a standard tool rbr 
toxicologists. The main issues raised at the 
workshop by those with hands-on experience 
were the following: 

• Expense: the cost of purchasing/contracting 
this technology is still too great for manv 
individual laboratories. 
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Figure 2. Pottrmal effects of gene knockout within 
oosmveiy and negatively regulated gene expression 
networks, t, is limiting in wild type for expression of 
u (41 A simple, two-component linear regulatory 
nerworn operating on gene ^ where i, is a positive 
sffector of ^ and j n is either a positive or negative 
snector of ly This network could be deduced by 
examining the consequence of 15) deleting / on the 
wpression of i, and ± where the expression of L 
*ouid be decreased or increased depending on 
vneiher j„ was a positive or negative regulator, 
fhese and other connected components of even 
ireater comoleiity could be revealed by genome- 
vide expression analysis. From Butow ( 75) 



• Gones: the iogisna of identirnng. obtaining 
and maintaining a set of nonredundanr. non- 
conaminared sequence-venned. species; eel 1 ' 
ossue/ndd-speanc ciones. 

• Use or inbred strains: where whole-organ ism 
models are being used, the use of inbred 
strains is important to reduce the potenuaiiv 
contusing effects of the individual variation 
typically seen in outbred populations. 

• Probe die need rbr relan very large amounts 
ot RNA. which limits the type of sample 
<e.g.. biopsy) chat can be used. .Also, different 
RNA extraction methods can give different 
results. 

• Specificity: the ability to discriminate accu- 
rately between dosdy related genes <e.s.. the 
otochrome p-oO family) and splice variants. 

• Quantitation: the quantitation of gene 
expression using gene arrays is still open to 
debate. One reason for this is the different 
incorporation of the labeling dyes. However, 
tne main difficulty lies in knowing what to 
normalize against. One opdon is to indude a 
large number of so-called housekeeping genes 
in the array. However, the expression of these 
genes often change depending on the tissue 
and the toxicant, so it is necessarv to charac- 
terize the expression of these genes in the 
model system before utilizing them. This is 
dearly not a viable option when screening 
multiple new compounds. A second option 
is to indude on the array genes from a nonrt- 
lated species (e.g.. a plant gene on an animal 

array) and to spike the probe with synthetic 
RNA(s) complementary to the genefs). 

• Reproducibility: this is sometimes question- 
able, and a figure of approximately two or 
three repeats was used as the minimum num- 
ber required to confirm initial findings. 
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.Again, however, mos: r*eor.i- :-. 
use or Nortnem biots or rr\rr*r ::ir^'—^- 
PCR to connrm nno:r.ci 

• iensitiviry: concerns were vo;crj aro-j: 
"number of arpc: moiecuics thai mus: rx r:r- 

scnt in a sampie ror them to rx crtcctec or. 
the array. 

• rmcierir- reproducible idemincauon of 1.^- 
to --told differences in expression was report- 
ed, although the number of genes that 
undergo this levci of change and remain 
undetected is open to debate. I; is important 
that this level of detection be uitimateiv 
achieved because it iscommoniy pcuct\ed 
that some important transcription factors 
and their regulators respond at such iow lo- 
ck In most cases. > to Wold was the mini- 
mum change that most were happv to 
accept. 

• Bioinrbrmaiicj: perhaps the greater concern 
was how to accurately interpret the ciau with 
the greatest accuracv jnd efficiency. The 
biggest headache is trying to identity net- 
works of gene expression that are common to 
different treatments or uo*s. Tne amount of 
data from a singie experiment is huge. It may 
be that, in the future, several groups individ- 
uaJiv equipped wirh specialized software algo- 
rithms for studying their favorite genes or 
gene systems will be able to share the same 
hybridized chips. Thus, arravs could usher in 
a new perspective on collaboration and the 
sharing of data. 

EPAMAC 

Perhaps the main reason most scientists are 
unable to use array technology is the high cost 
involved, whether buying orT-the-shelf mem- 
branes, using contract printing services, or 
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producing chips in-house. In view of this, 
researchers at the RTD'NHEERL initiated 
the EPAMAC. This consortium brines 
toeether scientists rrom the EPA and a num- 
ber of extramural labs with the aim of devel- 
oping microanay capability through the shar- 
ing "of resources and data. EPAMAC 
researchers arc primarily interested in the 
developmental and toxicologic changes seen 
in testicular and breast tissue, and a ponton 
of the workshop was set aside tor EPAMAC 
members to share their ideas on how the 
experimental application of microa/rays could 
facilitate their research. . One of the central 
areas of interest to EPAMAC members is the 
effect of xenobiotics on male fertiliry and 
reproductive health. Of greatest concern is 
rhe effect of exposure during critical periods 
of development and germ cell differentiation 
(57), and how this may compromise sperm 
counts and quality following sexual matura- 
tion 1/0). As well as spcrmatogenic tissue, 
there is also interest in how residual mRNA 
found in mature sperm {J J) could bcused as 
an indicator of previous xenobiotic erteco (it 
is easier to obtain a semen sample dun a tes- 
ticular biopsvi. Arrays will be used to examine 
and compare the effect of exposure to heat 
and chemicals in testicular and epididymal 
gene expression profiles, with the aim of 
establishing relationships/associations 
between changes in developmental landmarks 
and the effects on sperm count and quality. 
Cluster, pattern, and other analysis of such 
data should help identify* hidden relationships 
between genes that may reveal potential 
mechanisms of action and uncover roles for 
genes with unknown functions. 

Summary 

The rull impact of DNA arrays may not be 
-een for several vean. but the interest shown at 
:rus reaonai workshop inriinrn me high level 
or interest thai tney roster. Apart rrom educat- 
ing and advertising the various technologies in 
this field, this workshop brought together a 
number of researchers from the Research 
Triangle Park area who are already using DNA 
arrays. The interest in sharing ideas and experi- 
ences led to the initiation of a Triangle amy 
user s group. 



Array technology is still in its inancy. This 
means that the hardware is still improving and 
there ts no current consensus tor standard pro- 
cedures, quantitation, and interpretation. 
Consistency in spotting and scanning arrays is • 
not yet opnmued. and this is one of the most 
critical requirements of any experiment. In- - 
addition, one of the dark regions of arm* tech- 
nology — strife in the courts over who owns 
what portions of it — has further muddled oHe 
future and is a potential barrier toward the 
development of consensus procedures. 

Perhaps the greatest hurdle rbr the applica- 
tion of arrays is the actual interpretation of 
data. No specialists in bioinformana attended 
the workshop, largely hrcauv* they are rare and 
because as yet no one seems dear on the best 
method of approaching data analysis and inter- 
pretation. Cros>referexiang results from mul- 
tiple experiments tame, dose, repeats, different 
animals, di fferen t species) to idenbfy common- 
ly expressed genes is a great challenge. In most 
cases, we are still a long way from understand- 
ing how the expression of gene X is mated to 
the expression of gene Y. and ordering gene 
expression to delineate causal relationships. 

To the ordinary' scientist in the typical lab- 
oratory*, however, the most immediate prob- 
lem is a lack of affordable instrumentation. 
One can purchase premade membranes at 
relatively affordable prices. Although these 
may be useful in identifying individual genes 
to pursue in more detail using other methods, 
the numbers that would be required rbr even a 
small routine toxicology experiment prohibit 
this as a truly viable approach. For the toxicol- 
ogtst. there is a need to carry* out multiple 
experiments-— dose responses, time curves, 
multiple animals, and repeats. Glass-based 
DNA arrays are most attractive in this context 
because they can be prepared in large batches 
from the same DNA source and accommo- 
date control and created samples on the same 
chip. Another proDicm witn current off-the- 
shelf arrays is that they often do not contain 
one or more of the particular genes a group is 
interested in. One alternative is to obtain 
and/or produce a set of custom clones and 
have contract printing of membranes or slides 
carried out by a company such as Genomic 
Solubons, Inc. (Ann Arbor. Ml). This approach 
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is less expensive :har._;jvi-; * * 

one s own entire svtterr.. a.rnouc" >^rr.. 
point it mien: maxr ccor.orr... ar.< rr:~" 
one's owq arravs. 

Finally. DNA arravs are currentiv j tearr. 
effort. Thev are a technoiocv era: uses j uu: 
.jangc of skills including engincenng. statistic*, 
motecuiar bioiop. cnemistry. and bipintor* 
matics. Because mos; individuals are skilled m 
only one or perhaps rwo of these areas, it 
appears that success with arravs may be best 
expected bv teams of collaborators consisting 
of individuals having each o: these skills. 

Those considering arrav applications mav 
be amused or goaded on bv the following 
quote from Fortune magazine \ 

Mtcrooroccnori nave reirupta our econorm . 
uuwneu van tomino arid chanced the w «c irve. 
Gene duos couic be rvtn Oigpn. 

Although this comment may have been 
designed to excite the imagination rather than 
accurately reflect the truth, it is fair to say that 
the age of functional genomics is upon us. 
DNA arravs look set to be an important tool in 
this new age of biotechnology and will likely 
contribute answers to some of toxicology's 
most fundamental questions. 

Rehekenccs ano Notes 

1. Tht chipping torccsii Nat Gmtt ZUSuftOl 11X0 11ML 

2 National Ctnur lor Biotaennotogy mtomanon. Tfca 
Unigtnt Svsttm. A«»H«oit: «vww.ncki.ftM.«Hh.9«v/ 
ScruMf.-UniGtnt lent* 22 March tmi 

3 Brown PO. Tht Brown tab. Avaiiaala: htUJ/ 
cfngfn.Stafttord.tOitfOOrown Icittd 22 March IS3J. 

4 Chan JJ. Wu R. Vang PC. Huang JY. Shor Yf. Kan MM. 
Uo WC: Lit rj. thai Tf . Chang F. tt al. Pi ufi tawj oartt- 
stan oaaams and isolating fliMtf anua*V atprataad ganaa 
Ov cDNA mtcroirrav wttrm wim cotoronaw oaiacooft. 
Genomic! 51 J13-32* 119911 

5. Ward S. ON* Microarrav Tecimoiogv to Mtnnfy Gofm 
Controlling Sotrmaiegtnttit. AvttlaWt: www.mck. 
anxona Mun*arBUeyinicroafr»v.ltDnl Icxtd 22 Marcn IWL 

5 Manon MJ. Otftiti JL Bemwe HA. Ivor VH Movot MR. 
Koocm CJ. StDuetran R. Burtnarfl J. Sloot 0. 0* H. ft 
•1 Omg taroci vaaoioon ano wtnntitauon 01 aoconoofy 
erug target tnicts usmg ONA mcroomvi- Mat M«< 
* 1293-1301 M«t) 

7 Brown f 0. Tht M Vtast Gtnomt on a On» AvotfatJa: 
tmo7/cmgm.mnioro.tou/po«ownnrtafuti^.noid Ictiod 
22 March 19991. 

t Hwwavatr EF. Bmntr M. 1r«m J. BamB JL Atifcah CA. 
Microarravi and iBi.c&iogy: tht advam •« tasica«a- 
nonues M* Caremog 240J1SJ-1H lltBL 

1 Mtcm NB. Moiacyiar maenanumi of mala farm cat aW- 
ttrvmiitjon. fiiotuavft 2ft5SS-S$1 (19JBL 

10 Zacnartwat! TB Imetny K. Zacharawaa i. Avatf aMc: 
www^hjimi.adurti«**'wciw fjwi leaof 22 Match tBaat 

11. Krwnar JA. Kiawau SA. BMA in apatawtom: aa**c»- 
ttont lor tha atitr«an« haploid gaoaaaa. Mol H« 
Raorod 3 473-171 11t*«. 

12. Snpft 0. Gana chtp oraakthrough. Fartvna. Marcn 
3155-73119971. 

13. Kawaiati E lOanaral Scanning iMtroaaflU. 
Wttanown. MAI. unputofahad data. 

14. Ftowan f . 0»araaet J. Maca ML Jr. ••gliajfM Hi, 
Egoart WJL ToMare H. Honunan t, MaMafa X ton 
SO. Davaiopmaht and Parlormanca af • Novo! 
Mtcroarravmg Syatam Batad on Sarfaca Tanttoa 
Forcti. Available hnp J/www.ganaticroiero.ceaV 
rwouttasAwnWeaidapnng Jnml Icaad 22 March ItHl 

- n Eotow B lunwamv of Tvxas Mtdtcai Cam, Daftas. TXL 



Docket No.: PF4221-3 D1V 
USSN: 09/991,212 
Ref.No. VI 

Subject: RE: [Fwd: T xic logy Chip] 
Date: Mon. 3 Jul 2000 08:09:45 -0400 
From: "Afshari.Cynthia" <afshari<S'niehs.nih.gov> 
To: ""Diana Hamlet-Cox"" <dianahc@'incvte.com> 

You car. see the list of clones that we have or* our 12;-". chip at 
http: mar-uel .niehs .nih. gov rr>aps • guest ■ clone srch . cfr. 

v;e selected a subset of genes (2000K) that we believed critical to to:-: 
response and basic cellular processes and added a set of clones and ISTs to 
this. We have included a set of control genes (80*) that were selected by 
the NHGRI because they did not change across a large set of array 
experiments. However, we have found that some of these genes chance 
signf icar.tly after tox treatments and are in the process cf looking at the 
variation of each of these 80* genes across our experiments. 
Our chips are constantly changing and being updated and we hope that our 
data will lead us to what, the toxchip should really be. 
I hope this answers your question. 
Cindy Afshari 

> 

> From: Diana Hamlet -Cox 

> Sent: Monday, June 26, 2000 8:52 PM 

> To: afshariQniehs.nih.gov 

> Subject: (Fwd: Toxicology Chip] 
> 

> Dear Dr. Afshari, 
> 

> Since X have not yet had a response from Bill Grigg, perhaps he was not 

> the right person to contact. 
> 

> Can you help me in this matter? I don't need to know the sequences, 

> necessarily, but I would like very much to know what types of sequences 

> are being used, e.g., GPCRs (more specific?) ,. ion channels, etc. ' 
> 

> Diana Hamlet-Cox 
> 

> Original Message 

> Subject; Toxicology Chip 

> Date: Mon, 19 Jun 2000 18:31:48 -0700 

> From: Diana Hamlet-Cox <dianahc@incyte.com> 

> Organization: Incyte Pharmaceuticals 

> To: grigg&niehs. nih.gov 
> 

> Dear Colleague: 
> 

> I am doing literature research on the use of expressed genes as 

> pharmacotoxicology markers, and found the Press Release dated February 

> 29, 2000 regarding the work of the NIEHS in this area. I would like to 

> know if there is a resource I can access (or you could provide?) that 

> would give me a list of the 12,000 genes that are on your Human ToxChip 

> Microarray. In particular, I am interested in the criteria used to 

> select sequences for. the ToxChip, including any control sequences 

> included in the microarray. 
> 

> Thank you for your assistance in this request. 
> 

> Diana Hamlet -Cox, Ph.D. .. 

> Incyte Genomics, Inc. 
> 

> -- 
> 



07/31/2000 10:34 AM 



> This email message is for the sole use of zhe mz ended rec^cier.z s an 

> may conzain confidential and privileged inforrzazion subjecz zo 

> azzomey-clienz privilege. Any unauzhoriced review, use. disclosure' or 

> diszribuzion is prohibized. If you are noz zhe inz ended recipienz. 

> please conzacz zha sender by reply email and deszroy all copies cf zhe 

> original message. 

> ========================= 

> 

> 




t 



• 'J 



* 



3 



uocfcet Na:PF-0221-3 
VSSH: 09/991,212 
Ref. No. Vli 



Proc. Natl Acad. Sci. USA 

Vol. 95, pp. 6073-6078, May 1998 

Biochemistry 

Assessing sequence comparison methods with reliable structurally 
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ABSTRACT Pairwise sequence comparison methods have 
been assessed using proteins whose relationships are known 
reliably from their structures and functions, as described in 
the SCOP database [Murzin, A. G., Brenner, S. E., Hubbard, T. 
& Chothia C. (1995) J. Mol. Biol. 247, 536-540], The evalua- 
tion tested the programs BLAST [Altschul, S. F., Gish, W., 
Miller, W M Myers, E. W. & Lipman, D. J. (1990). J, Mol. Biol. 
215, 403-410], WU-BLAST2 [Altschul, S. F. & Gish, W. (1996) 
Methods Enzymol. 266, 460-480], FA5TA [Pearson, W. R. & 
Lipman, D. J. (1988) Proc. Natl. Acad. Sci. USA 85, 2444-2448] , 
and SSEARCH [Smith, T. F. & Waterman, M. S. (1981) J. Mol. 
Biol. 147, 195-197] and their scoring schemes. The error rate 
of all algorithms is greatly reduced by using statistical scores 
to evaluate matches rather than percentage identity or raw 
scores. The E- value statistical scores of SSEARCH and fast A are 
reliable: the number of false positives found in our tests agrees 
well with the scores reported. However, the P- values reported 
by BLAST and WU-BLAST2 exaggerate significance by orders of 
magnitude, ssearch, fasta ktup = 1, and wu-BLAST2 perform 
best, and they are capable of detecting almost all relationships 
between proteins whose sequence identities are >30%. For 
more distantly related proteins, they do much less well; only 
one-half of the relationships between proteins with 20-30% 
identity are found. Because many homologs have low sequence 
similarity, most distant relationships cannot be detected by 
any pairwise comparison method; however, those which are 
identified may be used with confidence. 



Sequence database searching plays a role in virtually every 
branch of molecular biology and is crucial for interpreting the 
sequences issuing forth from genome projects. Given the 
method's central role, it is surprising that overall and relative 
capabilities of different procedures are largely unknown. It is 
difficult to verify algorithms on sample data because this 
requires large data sets of proteins whose evolutionary rela- 
tionships are known unambiguously and independently of the 
methods being evaluated. However, nearly all known ho- 
mologs have been identified by sequence analysis (the method 
to be tested). Also, it is generally very difficult to know, in the 
absence of structural data, whether two proteins that lack clear 
sequence similarity are unrelated. This has meant that al- 
though previous evaluations have helped improve sequence 
comparison, they have suffered from insufficient, imperfectly 
characterized, or artificial test data. Assessment also has been 
problematic because high quality database sequence searching 
attempts to have both sensitivity (detection of homologs) and 
specificity (rejection of unrelated proteins); however, these 
complementary goals are linked such that increasing one 
causes the other to be reduced. 
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Sequence comparison methodologies have evolved rapidly, 
so no previously published tests has evaluated modern versions 
of programs commonly used. For example, parameters in 
BLAST (1) have changed, and WU-BLAST2 (2>— which produces 
gapped alignments— has become available. The latest version 
of fasta (3) previously tested was 1.6, but the current release 
(version 3.0) provides fundamentally different results in the 
form of statistical scoring. 

The previous reports also have left gaps in our knowledge. 
For example, there has been no published assessment of 
thresholds for scoring schemes more sophisticated than per- 
centage identity. Thus, the widely discussed statistical scoring 
measures have never actually been evaluated on large data- 
bases of real proteins. Moreover, the different scoring schemes 
commonly in use have not been compared. 

Beyond these issues, there is a more fundamental question: 
in an absolute sense, how well does pairwise sequence com- 
parison work? That is, what fraction of homologous proteins 
can be detected using modern database searching methods? 

In this work, we attempt to answer these questions and to 
overcome both of the fundamental difficulties that have hin- 
dered assessment of sequence comparison methodologies. 
First, we use the set of distant evolutionary relationships in the 
SCOP: Structural Classification of Proteins database (4), which 
is derived from structural and functional characteristics (5). 
The scop database provides a uniquely reliable set of ho- 
mologs, which are known independently of sequence compar- 
ison. Second, we use an assessment method that jointly mea- 
sures both sensitivity and specificity. This method allows 
straightforward comparison of different sequence searching 
procedures. Further, it can be used to aid interpretation of real 
database searches and thus provide optimal and reliable 
results. 

Previous Assessments of Sequence Comparison. Several 
previous studies have examined the relative performance of 
different sequence comparison methods. The most encom- 
passing analyses have been by Pearson (6, 7), who compared 
the three most commonly used programs. Of these, the Smith- 
Waterman algorithm (8) implemented in SSEARCH (3) is the 
oldest and slowest but the most rigorous. Modern heuristics 
have provided blast (1) the speed and convenience to make 
it the most popular program. Intermediate between these two 
is fasta (3), which may be run in two modes offering either 
greater speed (ktup = 2) or greater effectiveness (ktup .= 1). 
Pearson also considered different parameters for each of these 
programs. 

To test the methods, Pearson selected two representative 
proteins from each of 67 protein superfamilies defined by the 
PIR database (9). Each was used as a query to search the 
database, and the matched proteins were marked as being 
homologous or unrelated according to their membership of pir 
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superfamilies. Pearson found that modern matrices and "In- 
scaling" of raw scores improve results considerably. He also 
reported that the rigorous Smith- Waterman algorithm worked 
slightly better than FASTA, which was in turn more effective 
than blast. 

Very large scale analyses of matrices have been performed 
(10), and Henikoff and Henikoff (11) also evaluated the 
effectiveness of blast and Fasta. Their test with blast 
considered the ability to detect homologs above a predeter- 
mined score but had no penalty for methods which also 
reported large numbers of spurious matches. The Henikoffs 
searched the swiss-PROT database (12) and used PROSITE (13) 
to define homologous families. Their results showed that the 
BLOSUM62 matrix (14) performed markedly better than the 
extrapolated PAM-series matrices (15), which previously had 
been popular. 

A crucial aspect of any assessment is the data that are used 
to test the ability of the program to find homologs. But in 
Pearson's and the Henikoffs* evaluations of sequence com- 
parison, the correct results were effectively unknown. This is 
because the superfamilies in pir and PROSITE are principally 
created by using the same sequence comparison methods 
which are being evaluated. Interdependency of data and 
methods creates a "chicken and egg" problem, and means for 
example, that new methods would be penalized for correctly 
identifying homologs missed by older programs. For instance, 
immunoglobulin variable and constant domains are clearly 
homologous, but pir places them in different superfamilies. 
The problem is widespread: each superfamily in pir 48.00 with 
a structural homolog is itself homologous to an average of 1.6 
other pir superfamilies (16). 

To surmount these sorts of difficulties, Sander and Schnei- 
der (17) used protein structures to evaluate sequence com- 
parison. Rather than comparing different sequence compari- 
son algorithms, their work focused on determining a length- 
dependent threshold of percentage identity, above which all 
proteins would be of similar structure. A result of this analysis 
was the hssp equation; it states that proteins with 25% identity 
over 80 residues will have similar structures, whereas shorter 
alignments require higher identity. (Other studies also have 
used structures (18-20), but these focused on a small number 
of model proteins and were principally oriented toward eval- 
uating alignment accuracy rather than homology detection.) 

A general solution to the problem of scoring comes from 
statistical measures (i.e., E-values and P-values) based on the 
extreme value distribution (21). Extreme value scoring was 
implemented analytically in the blast program using the 
Karlin and Altschul statistics (22, 23) and empirical ap- 
proaches have been recently added to fasta and ssearch. In 
addition to being heralded as a reliable means of recognizing 
significantly similar proteins (24, 25), the mathematical trac- 
tability of statistical scores "is a crucial feature of the blast 
algorithm" (1). The validity of this scoring procedure has been 
tested analytically and empirically (see ref. 2 and references in 
ref; 24). However, all large empirical tests used random 
sequences. that may lack the subtle structure found within 
biological sequences. (26, 27) and obviously do not contain any 
real homologs. Thus, although many researchers have sug- 
gested that statistical scores be used to rank matches (24, 25, 
28), there have been no large rigorous experiments on biolog- 
ical data to determine the degree to which such rankings are 
superior. 

A Database for Testing Homology Detection. Since the 
discovery that the structures of hemoglobin and myoglobin are 
very similar though their sequences are not (29), it has been 
apparent that comparing structures is a more powerful (if less 
convenient) way to recognize distant evolutionary relation- 
ships than comparing sequences. If two proteins show a high 
degree of similarity in their structural details and function, it 



is very probable that they have an evolutionary relationship 
though their sequence similarity may be low. 

The recent growth of protein structure information com- 
bined with the comprehensive evolutionary classification in 
the SCOP database (4, 5) have allowed us to overcome previous 
limitations. With these data, we can evaluate the performance 
of sequence comparison methods on real protein sequences 
whose relationships are known confidently. The SCOP database 
uses structural information to recognize distant homologs, the 
large majority of which can be determined unambiguously. 
These superfamilies, such as the globins or the immunoglobu- 
lins, would be recognized as related by the vast majority of the 
biological community despite the lack of high sequence sim- 
ilarity. 

From scop, we extracted the sequences of domains of 
proteins in the Protein Data Bank (PDB) (30) and created two 
databases. One (PDB90D-B) has domains, which were all <90% 
identical to any other, whereas (PDB40D-B) had those <40% 
identical. The databases were created by first sorting all 
protein domains in scop by their quality and making a list. The 
highest quality domain was selected for inclusion in the 
database and removed from the list. Also removed from the list 
(and discarded) were all other domains above the threshold 
level of identity to the selected domain. This process was 
repeated until the list was empty. The PDB40D-B database 
contains 1,323 domains, which have 9,044 ordered pairs of 
distant relationships, or M).5% of the total 1,749,006 ordered 
pairs. In pdbwd-b, the 2,079 domains have 53,988 relation- 
ships, representing 1.2% of all pairs. Low complexity regions 
of sequence can achieve spurious high scores, so these were 
masked in both databases by processing with the SEG program 
(27) using recommended parameters: 12 1.8 2.0. The databases 
used in this paper are available from http://sss.stanford.edu/ 
sss/, and databases derived from the current version of scop 
may be found at http://scop.mrc-lmb.cam.ac.uk/scop/. 

Analyses from both databases were generally consistent, but 
PDB40D-B focuses on distantly related proteins and reduces the 
heavy ove representation in the PDB of a small number of 
families (31, 32), whereas PDB90D-B (with more sequences) 
improves evaluations of statistics. Except where noted other- 
. wise, the distant homolog results here are from PDB40D-B. 
Although the precise numbers reported here are specific to the 
structural domain databases used, we expect the trends to be 
general. 

Assessment Data and Procedure. Our assessment of se- 
quence comparison may be divided into four different major 
categories of tests. First, using just a single sequence compar- 
ison algorithm at a time, we evaluated the effectiveness of 
different scoring schemes. Second, we assessed the reliability 
of scoring procedures, including an evaluation of the validity 
of statistical scoring. Third, we compared sequence compari- 
son algorithms (using the optimal scoring scheme) to deter- 
mine their relative performance. Fourth, we examined the 
distribution of homologs and considered the power of pairwise 
sequence comparison to recognize them. All of the analyses 
used the databases of structurally identified homologs and a 
new assessment criterion.' 

The analyses tested BLAST (1), version 1.4.9MP, and wu- 
BLAST2 (2), version 2.0a 13MP. Also assessed was the fasta 
package, version 3.0t76 (3), which provided fasta and the 
ssearch implementation of Smith-Waterman (8). For 
ssearch and fasta, we used BLOSUM45 with gap penalties 
-12/-1 (7, 16). The default parameters and matrix (BLO- 
SUM62) were used for BLAST and WU-BLAST2. 

The "Coverage Vs. Error" PloL To test a particular protocol 
(comprising a program and scoring scheme), each sequence 
from the database was used as a query to search the database. 
This yielded ordered pairs of query and target sequences with 
associated scores, which were sorted, on the basis of their 
scores, from best to worst. The ideal method would have 
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Fig. 1. Coverage vs error plots of different scoring schemes for ssearch Smith-Waterman. (A ) Analysis of pdbmd-b database. (B) Analysis 
of PDB90D-B database. All of the proteins in the database were compared with each other using the ssearch program. The results of this single 
set of comparisons were considered using five different scoring schemes and assessed. The graphs show the coverage and errors per query (EPQ) 
for statistical scores raw scores, and three measures using percentage identity. In the coverage vs. error plot, the* axis indicates the fraction of 
all homologs in the database (known from structure) which have been detected. Precisely, it is the number of detected pairs of proteins with the 
same told divided by the total number of pairs from a common superfamiiy. PDB40D-B contains a total of 9,044 homologs. so a score of 10% indicates 
identification of 904 relationships. They axis reports the number of EPQ. Because there are 1,323 queries made in the PDB40D-B all-vs.-all 
comparison 13 errors corresponds to 0.01, or 1% EPQ. They axis is presented on a log scale to show results over the widely varying degrees of 
accuracy which may be desired. The scores that correspond to the levels of EPQ and coverage are shown in Fig. 4 and Table 1 The graph 
™7 l ?S r ? eS | 1 A~° ft ^u*"" f CI f tivit * and "Activity. As more homologs are found (moving to the right), more errors are made (moving 
up). The idea method would be in the lower right corner of the graph, which corresponds to identifying many evolutionary relationships without 
selecting unrelated proteins. Three measures of percentage identity are plotted. Percentage identity within alignment is the degree of identity within 
the aligned region of the proteins, without consideration of the alignment length. Percentage identity within both is the number of identical residues 
in me aligned region as a percentage of the average length of the query and target proteins. The hssp equation (17) is H = 290.15/" 0 - 562 where 
is length for 10 < / < 80; H > 100 for / < 10; H - 24.7 for / > 80. The percentage identity HSSP-adjusted score is the percent identity within 
the alignment minus H. Smith- Waterman raw scores and E-values were taken directly from the sequence comparison program. 



perfect separation, with all of the homologs at the top of the 
list and unrelated proteins below. In practice, perfect separa- 
tion is impossible to achieve so instead one is interested in 
drawing a threshold above which there are the largest number 
of related pairs of sequences consistent with an acceptable 
error rate. 

Our procedure involved measuring the coverage and error 
for every threshold. Coverage was defined as the fraction of 
structurally determined homologs that have scores above the 
selected threshold; this reflects the sensitivity of a method. 
Errors per query (EPQ), an indicator of selectivity, is the 
number of nonhomologous pairs above the threshold divided 
by the number of queries. Graphs of these data, called 
coverage vs. error plots, were devised to understand how 



protocols compare at different levels of accuracy. These 
graphs share effectively all of the beneficial features of Re- 
ciever Operating Characteristic (ROC) plots (33, 34) but 
better represent the high degrees of accuracy required in 
sequence comparison and the huge background of nonho- 
mologs. 

This assessment procedure is directly relevant to practical 
sequence database searching, for it provides precisely the 
information necessary to perform a reliable sequence database 
search. The EPQ measure places a premium on score consis- 
tency; that is, it requires scores to be comparable for different 
queries. Consistency is an aspect which has been largely 
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FlC. 2. Unrelated proteins with high percentage identity. Hemo- 
globin 0-chain (pdb code lhds chain b, ref. 38, Left) and cellulase E2 
(PDB code Itml, ref. 39. Right) have 39% identity over 64 residues, a 
level which is often believed to be indicative of homology. Despite this 
high degree of identity, their structures strongly suggest that these 
proteins are not related. Appropriately, neither the raw alignment 
score of 85 nor the E-value of 1 J is significant. Proteins rendered by 
RASMOL (40). 
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Fig. 3. Length and percentage identity of alignments of unrelated 
proteins in pdboqd-b: Each pair of nonhomologous proteins found with 
ssearch is plotted as a point whose position indicates the length and 
the percentage identity within the alignment. Because alignment 
length and percentage identity are quantized, many pairs of proteins 
may have exactly the same alignment length and percentage identity. 
The line shows the hssp threshold (though it is intended to be applied 
with a different matrix and parameters). 
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FlG. 4. Reliability of statistical scores in PDBWD-b: Each line shows 
the relationship between reported statistical score and actual error 
rate for a different program. E- values are reported for SSEARCH and 
fast A, whereas P-values are shown for blast and wu-BLasT2. If the 
scoring were perfect, then the number of errors per query and the 
E-valucs would be the same, as indicated by the upper bold line. 
(P- values should be the same as EPQ for small numbers, and diverges 
at higher values, as indicated by the lower bold line.) E-values from 
ssearch and fast a are shown to have good agreement with EPQ but 
underestimate the significance slightly, blast and wu-BLAST7 are 
overconfident, with the degree of exaggeration dependent upon the 
score. The results for PDB40D-B were similar to those for PDB9QD-B 
despite the difference in number of homologs detected. This graph 
could be used to roughly calibrate the reliability of a given statistical 
score. 

ignored in previous tests but is essential for the straightforward 
or automatic interpretation of sequence comparison results. 
Further, it provides a clear indication of the confidence that 
should be ascribed to each match. Indeed, the EPQ measure 
should approximate the expectation value reported by data- 
base searching programs, if the programs* estimates are accu- 
rate. 

The Performance of Scoring Schemes. All of the programs 
tested could provide three fundamental types of scores. The 
first score is the percentage identity, which may be computed 
in several ways based on either the length of the alignment or 
the lengths of the sequences. The second is a "raw" or 
"Smith-Waterman" score, which is the measure optimized by 
the Smith-Waterman algorithm and is computed by summing 
the substitution matrix scores for each position in the align- 
ment and subtracting gap penalties. In BLAST, a measure 
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related to this score is scaled into bits. Third is a statistical 
score based on the extreme value distribution. These results 
are summarized in Fig. 1. 

Sequence Identity. Though it has been long established that 
percentage identity is a poor measure (35), there is a common 
rule-of-thumb stating that 30% identity signifies homology. 
Moreover, publications have indicated thai 25% identity can 
be used as a threshold (17, 36). We find that these thresholds, 
originally derived years ago, are not supported by present 
results. As databases have grown, so have the possibilities for 
chance alignments with high identity; thus, the reported cutoffs 
lead to frequent errors. Fig. 2 shows one of the many pairs of 
proteins with very different structures that nonetheless have 
high levels of identity over considerable aligned regions. 
Despite the high identity, the raw and the statistical scores for 
such incorrect matches are typically not significant. The prin- 
cipal reasons percentage identity does so poorly seem to be 
that it ignores information about gaps and about the conser- 
vative or radical nature of residue substitutions. 

From the PDB90D-B analysis in Fig. 3, we learn that 30% 
identity is a reliable threshold for this database only for 
sequence alignments of at least 150 residues. Because one 
unrelated pair of proteins has 43.5% identity over 62 residues, 
it is probably necessary for alignments to be at least 70 residues 
in length before 40% is a reasonable threshold, for a database 
of this particular size and composition. 

At a given reliability, scores based on percentage identity 
detect just a fraction of the distant homologs found by 
statistical scoring. If one measures the percentage identity in 
the aligned regions without consideration of alignment length, 
then a negligible number of distant homologs are detected. 
Use. of the HSSP equation improves the value of percentage 
identity, but even this measure can find only 4% of all known 
homologs at 1% EPQ. In short, percentage identity discards 
most of the information measured in a sequence comparison. 

Raw Scores. Smith-Waterman raw scores perform better 
than percentage identity (Fig. 1), but In-scaling (7) provided no 
notable benefit in our analysis. It is necessary to be very precise 
when using either raw or bit scores because a 20% change in 
cutoff score could yield a tenfold difference in EPQ. However, 
it is difficult to choose appropriate thresholds because the 
reliability of a bit score depends on the lengths of the proteins 
matched and the size of the database. Raw score thresholds 
also are affected by matrix and gap parameters. 

Statistical Scores. Statistical scores were introduced partly 
to overcome the problems thai arise from raw scores. This 
scoring scheme provides the best discrimination between 
homologous proteins and those which are unrelated. Most 

Sequence Comparison Algorithms (PDB90D-B) 



S 
O 



£ o.oi 



0.16 0.18 
Coverage 



0.22 




Fig. 5. Coverage vs. error plots of different sequence comparison methods: Five different sequence comparison methods are evaluated, each 
using statistical scores (E- or P-values). (A ) PDB4or>B database. In this analysis, the best method is the slow ssearch, which finds 18% of relationships 
at 1% EPQ. Fast a ktup ■ 1 and wu-BLAST2 are almost as good. (B) pdbwd-b database. The quick wu-blast? program provides the best coverage 
at 1% EPQ on this database, although at higher levels of error it becomes slightly worse than fast a ktup » 1 and ssearch. 
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likely, its power can be attributed to its incorporation of more 
information than any other measure; it takes account of the 
full substitution and gap data (like raw scores) but also has 
details about the sequence lengths and composition and is 
scaled appropriately. 

We find that statistical scores are not only powerful, but also 
easy to interpret, ssearch and FAST a show close agreement 
between statistical scores and actual number of errors per 
query (Fig. 4). The expectation value score gives a good, 
slightly conservative estimate of the chances of the two se- 
quences being found at random in a given query. Thus, an 
E-value of 0.01 indicates that roughly one pair of nonhomologs 
of this similarity should be found in every 100 different queries. 
Neither raw scores nor percentage identity can be interpreted 
in this way, and these results validate the suitability of the 
extreme value distribution for describing the scores from a 
database search. 

The P-values from blast also should be directly interpret- 
able but were found to overstate significance by more than two 
orders of magnitude for 1% EPQ for this database. Nonethe- 
less, these results strongly suggest that the analytic theory is 
fundamentally appropriate. WU-BLAST2 scores were more re- 
liable than those from blast, but also exaggerate expected 
confidence by more than an order of magnitude at 1% EPQ. 

Overall Detection of Homo logs and Comparison of Algo- 
rithms. The results in Fig. 5A and Table 1 show that pairwise 
sequence comparison is capable of identifying only a small 
fraction of the homologous pairs of sequences in PDB40D-B. 
Even ssearch with E-values, the best protocol tested, could 
find only 18% of all relationships at a 1% EPQ. BLAST, which 
identifies 15%, was the worst performer, whereas fasta 
ktup = 1 is nearly as effective as ssearch. fasta ktup = 2 and 
WU-BLAST2 are intermediate in their ability to detect ho- 
mologs. Comparison of different algorithms indicates that 
those capable of identifying more homologs are generally 
slower, ssearch is 25 times slower than blast and 6.5 times 
slower than fasta ktup = 1. wu-blasT2 is slightly faster than 
fasta ktup = 2, but the latter has more interpretable scores. 

In PDB90D-B, where there are many close relationships, the 
best method can identify only 38% of structurally known 
homologs (Fig. SB). The method which finds that many 
relationships is wu-BLAst2. Consequently, we infer that the 
differences between fasta kup - 1, ssearch, and WU-BLAST2 
programs are unlikely to be significant when compared with 
variation in database composition and scoring reliability. 

Fig. 6 helps to explain why most distant homologs cannot be 
found by sequence comparison: a great many such relation- 
ships have no more sequence identity than would be expected 
by chance. SSEARCH with E-values can recognize >90% of the 
homologous pairs with 30-40% identity. In this region, there 
are 30 pairs of homologous proteins that do not have signif- 
icant E-values, but 26 of these involve sequences with <50 
residues. Of sequences having 25-30% identity, 75% are 
identified by SSEARCH E-values. However, although the num- 
ber of homologs grows at lower levels of identity, the detection 
falls off sharply: only 40% of homologs with 20-25% identity 
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FlG. 6. Distribution and detection of homologs in PDB40D-B. Bars 
show the distribution of homologous pairs pdimod-b according to their 
identity (using the measure of identity in both). Filled regions indicate 
the number of these pairs found by ihe best database searching method 
(ssearch with E-values) at 1% EPQ. The PDB40D-B database contains 
proteins with <40% identity, and as shown on this graph, most 
structurally identified homologs in the database have diverged ex- 
tremely far in sequence and have <20% identity. Note that the 
alignments may be inaccurate, especially at low levels of identity. Filled 
regions show that SSEARCH can identify most relationships that have 
25% or more identity, but its detection wanes sharply below 25%. 
Consequently, the gTeat sequence divergence of most structurally 
identified evolutionary relationships effectively defeats the ability of 
pariwise sequence comparison to detect them. 

are detected and only 10% of those with 15-20% can be found. 
These results show that statistical scores can find related 
proteins whose identity is remarkably low; however, the power 
of the method is restricted by the great divergence of many 
protein sequences. 

After completion of this work, a new version of pairwise 
blast was released: blastgp (37). It supports gapped align- 
ments, like WU-BLAST2, and dispenses with sum statistics. Our 
initial tests on blastgp using default parameters show that its 
E-values are reliable and that its overall detection of homologs 
was substantially better than that of ungapped blast, but not 
quite equal to that of WU-BLAST2. 

CONCLUSION 

The general consensus amongst experts. (see refs. 7, 24, 25, 27 
and references therein) suggests that the most effective se- 
quence searches are made by (i) using a large current database 
in which the protein sequences have been complexity masked 
and (ii) using statistical scores to interpret the results. Our 
experiments fully support this view. 

Our results also suggest two further points. First, the E-val- 
ues reported by fasta and ssearch give fairly accurate 
estimates of the significance of each match, but the P-values 
provided by blast and wu-blast2 underestimate the true 



Table 1. Summary of sequence comparison methods. with PDB4or>B 



Method 


Relative Time* 


\% EPQ Cutoff 


Coverage at 1% EPQ 


ssearch % identity: within alignment 


25.5 


>70% 


<0.1 


ssearch % identity: within both 


25.5 


34% 


3.0 


ssearch % identity: HSSp-scaled 


25.5 


35% (hssp + 9.8) 


4.0 


ssearch Smith-Waterman raw scores 


25.5 


142 


10.5 


ssearch E-values 


25.5 


0.03 


• 18.4 


fasta ktup « 1 E -values 


3.9 


0.03 


17.9 


Fasta ktup - 2 E-values 


1.4 


0.03 


16.7 


wu-BLAST2 P-values 


1.1 


0.003 


17.5 


BLAST P-values 


1.0 


0.00016 


14.8 


•Times are from large database searches with genome proteins. 
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extent of errors. Second, ssearch, wu-blast2, and Fasta 
ktup - 1 perform best, though biast and fasta ktup = 2 
detect most of the relationships found by the best procedures 
and are appropriate for rapid initial searches. 

The homologous proteins that are found by sequence com- 
parison can be distinguished with high reliability from the huge 
number of unrelated pairs. However, even the best database 
searching procedures tested fail to find the large majority of 
distant evolutionary relationships at an acceptable error rate. 
Thus, if the procedures assessed here fail to find a reliable 
match, it does not imply that the sequence is unique; rather, it 
indicates that any relatives it might have are distant ones.** 



*• Additional and updated information about this work, including 
supplementary figures, may be found at http://sssjtanford.edu/sss/. 
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